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Preface 


“We make ourselves no promises, but we cherish the hope that the unobstructed 
pursuit of useless knowledge will prove to have consequences in the future 

as in the past” ... “An institution which sets free successive generations of 
human souls is amply justified whether or not this graduate or that makes a 
so-called useful contribution to human knowledge. A poem, a symphony, a 
painting, a mathematical truth, a new scientific fact, all bear in themselves all 
the justification that universities, colleges, and institutes of research need or 
require”, Abraham Flexner, The Usefulness of Useless Knowledge, 1939. 


“I suggest that you take the hardest courses that you can, because you learn 
the most when you challenge yourself... CS 121 I found pretty hard.”, Mark 
Zuckerberg, 2005. 


This is a textbook for an undergraduate introductory course on 
theoretical computer science. The educational goals of this book are to 
convey the following: 


e That computation arises in a variety of natural and human-made 
systems, and not only in modern silicon-based computers. 


e Similarly, beyond being an extremely important tool, computation 
also serves as a useful lens to describe natural, physical, mathemati- 
cal and even social concepts. 


e The notion of universality of many different computational models, 
and the related notion of the duality between code and data. 


e The idea that one can precisely define a mathematical model of 
computation, and then use that to prove (or sometimes only conjec- 
ture) lower bounds and impossibility results. 


e Some of the surprising results and discoveries in modern theoreti- 
cal computer science, including the prevalence of NP-completeness, 
the power of interaction, the power of randomness on one hand 
and the possibility of derandomization on the other, the ability 
to use hardness “for good” in cryptography, and the fascinating 
possibility of quantum computing. 
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I hope that following this course, students would be able to rec- 
ognize computation, with both its power and pitfalls, as it arises in 
various settings, including seemingly “static” content or “restricted” 
formalisms such as macros and scripts. They should be able to follow 
through the logic of proofs about computation, including the cen- 
tral concept of a reduction, as well as understanding “self-referential” 
proofs (such as diagonalization-based proofs that involve programs 
given their own code as input). Students should understand that 
some problems are inherently intractable, and be able to recognize the 
potential for intractability when they are faced with a new problem. 
While this book only touches on cryptography, students should un- 
derstand the basic idea of how we can use computational hardness for 
cryptographic purposes. However, more than any specific skill, this 
book aims to introduce students to a new way of thinking of computa- 
tion as an object in its own right and to illustrate how this new way of 
thinking leads to far-reaching insights and applications. 

My aim in writing this text is to try to convey these concepts in the 
simplest possible way and try to make sure that the formal notation 
and model help elucidate, rather than obscure, the main ideas. I also 
tried to take advantage of modern students’ familiarity (or at least 
interest!) in programming, and hence use (highly simplified) pro- 
gramming languages to describe our models of computation. That 
said, this book does not assume fluency with any particular program- 
ming language, but rather only some familiarity with the general 
notion of programming. We will use programming metaphors and 
idioms, occasionally mentioning specific programming languages 
such as Python, C, or Lisp, but students should be able to follow these 
descriptions even if they are not familiar with these languages. 

Proofs in this book, including the existence of a universal Turing 
Machine, the fact that every finite function can be computed by some 
circuit, the Cook-Levin theorem, and many others, are often con- 
structive and algorithmic, in the sense that they ultimately involve 
transforming one program to another. While it is possible to follow 
these proofs without seeing the code, I do think that having access 
to the code, and the ability to play around with it and see how it acts 
on various programs, can make these theorems more concrete for the 
students. To that end, an accompanying website (which is still a work 
in progress) allows executing programs in the various computational 
models we define, as well as seeing constructive proofs of some of the 
theorems. 


0.1 TO THE STUDENT 


This book can be challenging, mainly because it brings together a 
variety of ideas and techniques in the study of computation. There 


are quite a few technical hurdles to master, whether it is following 
the diagonalization argument for proving the Halting Problem is 
undecidable, combinatorial gadgets in NP-completeness reductions, 
analyzing probabilistic algorithms, or arguing about the adversary to 
prove the security of cryptographic primitives. 

The best way to engage with this material is to read these notes ac- 
tively, so make sure you have a pen ready. While reading, I encourage 
you to stop and think about the following: 


e When I state a theorem, stop and take a shot at proving it on your 
own before reading the proof. You will be amazed by how much 
better you can understand a proof even after only 5 minutes of 
attempting it on your own. 


e When reading a definition, make sure that you understand what 
the definition means, and what the natural examples are of objects 
that satisfy it and objects that do not. Try to think of the motivation 
behind the definition, and whether there are other natural ways to 
formalize the same concept. 


e Actively notice which questions arise in your mind as you read the 
text, and whether or not they are answered in the text. 


As a general rule, it is more important that you understand the 
definitions than the theorems, and it is more important that you 
understand a theorem statement than its proof. After all, before you 
can prove a theorem, you need to understand what it states, and to 
understand what a theorem is about, you need to know the definitions 
of the objects involved. Whenever a proof of a theorem is at least 
somewhat complicated, I provide a “proof idea.” Feel free to skip the 
actual proof in a first reading, focusing only on the proof idea. 

This book contains some code snippets, but this is by no means 
a programming text. You don’t need to know how to program to 
follow this material. The reason we use code is that it is a precise way 
to describe computation. Particular implementation details are not 
as important to us, and so we will emphasize code readability at the 
expense of considerations such as error handling, encapsulation, etc. 
that can be extremely important for real-world programming. 


0.1.1 Is the effort worth it? 

This is not an easy book, and you might reasonably wonder why 
should you spend the effort in learning this material. A traditional 
justification for a “Theory of Computation” course is that you might 
encounter these concepts later on in your career. Perhaps you will 
come across a hard problem and realize it is NP complete, or find a 
need to use what you learned about regular expressions. This might 
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very well be true, but the main benefit of this book is not in teaching 
you any practical tool or technique, but instead in giving you a differ- 
ent way of thinking: an ability to recognize computational phenomena 
even when they occur in non-obvious settings, a way to model compu- 
tational tasks and questions, and to reason about them. 

Regardless of any use you will derive from this book, I believe 
learning this material is important because it contains concepts that 
are both beautiful and fundamental. The role that energy and matter 
played in the 20th century is played in the 21st by computation and 
information, not just as tools for our technology and economy, but also 
as the basic building blocks we use to understand the world. This 
book will give you a taste of some of the theory behind those, and 
hopefully spark your curiosity to study more. 


0.2 TO POTENTIAL INSTRUCTORS 


I wrote this book for my Harvard course, but I hope that other lectur- 
ers will find it useful as well. To some extent, it is similar in content 
to “Theory of Computation” or “Great Ideas” courses such as those 
taught at CMU or MIT. 

The most significant difference between our approach and more 
traditional ones (such as Hopcroft and Ullman’s [HU69; HU79] and 
Sipser’s [Sip97]) is that we do not start with finite automata as our ini- 
tial computational model. Instead, our initial computational model 
is Boolean Circuits.! We believe that Boolean Circuits are more fun- 
damental to the theory of computing (and even its practice!) than 
automata. In particular, Boolean Circuits are a prerequisite for many 
concepts that one would want to teach in a modern course on theoret- 
ical computer science, including cryptography, quantum computing, 
derandomization, attempts at proving P # NP, and more. Even in 
cases where Boolean Circuits are not strictly required, they can of- 
ten offer significant simplifications (as in the case of the proof of the 
Cook-Levin Theorem). 

Furthermore, I believe there are pedagogical reasons to start with 
Boolean circuits as opposed to finite automata. Boolean circuits are a 
more natural model of computation, and one that corresponds more 
closely to computing in silicon, making the connection to practice 
more immediate to the students. Finite functions are arguably easier 
to grasp than infinite ones, as we can fully write down their truth ta- 
ble. The theorem that every finite function can be computed by some 
Boolean circuit is both simple enough and important enough to serve 
as an excellent starting point for this course. Moreover, many of the 
main conceptual points of the theory of computation, including the 
notions of the duality between code and data, and the idea of universal- 
ity, can already be seen in this context. 


1 An earlier book that starts with circuits as the initial 
model is John Savage's [Sav98]. 


After Boolean circuits, we move on to Turing machines and prove 
results such as the existence of a universal Turing machine, the un- 
computability of the halting problem, and Rice’s Theorem. Automata 
are discussed after we see Turing machines and undecidability, as an 
example for a restricted computational model where problems such as 
determining halting can be effectively solved. 

While this is not our motivation, the order we present circuits, Tur- 
ing machines, and automata roughly corresponds to the chronological 
order of their discovery. Boolean algebra goes back to Boole’s and 
DeMorgan’s works in the 1840s [Boo47; De 47] (though the defini- 
tion of Boolean circuits and the connection to physical computation 
was given 90 years later by Shannon [Sha38]). Alan Turing defined 
what we now call “Turing Machines” in the 1930s [Tur37], while finite 
automata were introduced in the 1943 work of McCulloch and Pitts 
[MP43] but only really understood in the seminal 1959 work of Rabin 
and Scott [RS59]. 

More importantly, while models such as finite-state machines, reg- 
ular expressions, and context-free grammars are incredibly important 
for practice, the main applications for these models (whether it is for 
parsing, for analyzing properties such as liveness and safety, or even for 
software-defined routing tables) rely crucially on the fact that these 
are tractable models for which we can effectively answer semantic ques- 
tions. This practical motivation can be better appreciated after students 
see the undecidability of semantic properties of general computing 
models. 

The fact that we start with circuits makes proving the Cook-Levin 
Theorem much easier. In fact, our proof of this theorem can be (and 
is) done using a handful of lines of Python. Combining this proof 
with the standard reductions (which are also implemented in Python) 
allows students to appreciate visually how a question about computa- 
tion can be mapped into a question about (for example) the existence 
of an independent set in a graph. 

Some other differences between this book and previous texts are 
the following: 


1. For measuring time complexity, we use the standard RAM machine 
model used (implicitly) in algorithms courses, rather than Tur- 
ing machines. While these two models are of course polynomially 
equivalent, and hence make no difference for the definitions of the 
classes P, NP, and EXP, our choice makes the distinction between 
notions such as O(n) or O(n”) time more meaningful. This choice 
also ensures that these finer-grained time complexity classes corre- 
spond to the informal definitions of linear and quadratic time that 
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students encounter in their algorithms lectures (or their whiteboard 
coding interviews...). 


2. We use the terminology of functions rather than languages. That is, 
rather than saying that a Turing Machine M decides a language L C 
{0, 1}*, we say that it computes a function F : {0,1}* —> {0,1}. The 
terminology of “languages” arises from Chomsky’s work [Cho56], 
but it is often more confusing than illuminating. The language 
terminology also makes it cumbersome to discuss concepts such 
as algorithms that compute functions with more than one bit of 
output (including basic tasks such as addition, multiplication, 
etc...). The fact that we use functions rather than languages means 
we have to be extra vigilant about students distinguishing between 
the specification of a computational task (e.g., the function) and its 
implementation (e.g., the program). On the other hand, this point is 
so important that it is worth repeatedly emphasizing and drilling 
into the students, regardless of the notation used. The book does 
mention the language terminology and reminds of it occasionally, 
to make it easier for students to consult outside resources. 


Reducing the time dedicated to finite automata and context-free 
languages allows instructors to spend more time on topics that a mod- 
ern course in the theory of computing needs to touch upon. These 
include randomness and computation, the interactions between proofs 
and programs (including Gédel’s incompleteness theorem, interactive 
proof systems, and even a bit on the A-calculus and the Curry-Howard 
correspondence), cryptography, and quantum computing. 

This book contains sufficient detail to enable its use for self-study. 
Toward that end, every chapter starts with a list of learning objectives, 
ends with a recap, and is peppered with “pause boxes” which encour- 
age students to stop and work out an argument or make sure they 
understand a definition before continuing further. 

Section 0.5 contains a “roadmap” for this book, with descriptions 
of the different chapters, as well as the dependency structure between 
them. This can help in planning a course based on this book. 
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0 
Introduction 


“Computer Science is no more about computers than astronomy is about 
telescopes”, attributed to Edsger Dijkstra.! 


“Hackers need to understand the theory of computation about as much as 
painters need to understand paint chemistry.”, Paul Graham 2003.2 


“The subject of my talk is perhaps most directly indicated by simply asking 

two questions: first, is it harder to multiply than to add? and second, why?...1 
(would like to) show that there is no algorithm for multiplication computation- 
ally as simple as that for addition, and this proves something of a stumbling 
block.”, Alan Cobham, 1964 


One of the ancient Babylonians’ greatest innovations is the place- 
value number system. The place-value system represents numbers as 
sequences of digits where the position of each digit determines its 
value. 

This is opposed to a system like Roman numerals, where every 
digit has a fixed value regardless of position. For example, the aver- 
age distance to the moon is approximately 259,956 Roman miles. In 
standard Roman numerals, that would be 


MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 
MMMMMMMMMMMMMMMMMMMMMMMMMMMMMM 
MMMMMMMMMMMMMMMMMMMDCCCCLVI 


Writing the distance to the sun in Roman numerals would require 
about 100,000 symbols; it would take a 50-page book to contain this 
single number! 
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Learning Objectives: 


e Introduce and motivate the study of 
computation for its own sake, irrespective of 
particular implementations. 


The notion of an algorithm and some of its 
history. 

Algorithms as not just tools, but also ways of 
thinking and understanding. 


Taste of Big-O analysis and the surprising 
creativity in the design of efficient 
algorithms. 


1 This quote is typically read as disparaging the 
importance of actual physical computers in Computer 
Science, but note that telescopes are absolutely 
essential to astronomy as they provide us with the 
means to connect theoretical predictions with actual 
experimental observations. 

? To be fair, in the following sentence Graham says 
“you need to know how to calculate time and space 
complexity and about Turing completeness”. This 
book includes these topics, as well as others such as 
NP-hardness, randomization, cryptography, quantum 
computing, and more. 
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For someone who thinks of numbers in an additive system like 
Roman numerals, quantities like the distance to the moon or sun are 
not merely large—they are unspeakable: they cannot be expressed or 
even grasped. It’s no wonder that Eratosthenes, the first to calculate 
the earth’s diameter (up to about ten percent error), and Hipparchus, 
the first to calculate the distance to the moon, used not a Roman- 
numeral type system but the Babylonian sexagesimal (base 60) place- 
value system. 


0.1 INTEGER MULTIPLICATION: AN EXAMPLE OF AN ALGORITHM 


In the language of Computer Science, the place-value system for rep- 
resenting numbers is known as a data structure: a set of instructions, 

or “recipe”, for representing objects as symbols. An algorithm is a set 
of instructions, or “recipe”, for performing operations on such rep- 
resentations. Data structures and algorithms have enabled amazing 
applications that have transformed human society, but their impor- 
tance goes beyond their practical utility. Structures from computer 
science, such as bits, strings, graphs, and even the notion of a program 
itself, as well as concepts such as universality and replication, have not 
just found (many) practical uses but contributed a new language and 
anew way to view the world. 

In addition to coming up with the place-value system, the Babylo- 
nians also invented the “standard algorithms” that we were all taught 
in elementary school for adding and multiplying numbers. These al- 
gorithms have been essential throughout the ages for people using 
abaci, papyrus, or pencil and paper, but in our computer age, do they 
still serve any purpose beyond torturing third-graders? To see why 
these algorithms are still very much relevant, let us compare the Baby- 
lonian digit-by-digit multiplication algorithm (“grade-school multi- 
plication”) with the naive algorithm that multiplies numbers through 
repeated addition. We start by formally describing both algorithms, 
see Algorithm 0.1 and Algorithm 0.2. 


Algorithm 0.1 — Multiplication via repeated addition. 
Input: Non-negative integers x, y 
Output: Product x - y 

: Let result + 0. 

: fori =1,...,ydo 


1 
2 
3: result + result + x 
4: end for 

5 


: return result 


Both Algorithm 0.1 and Algorithm 0.2 assume that we already 
know how to add numbers, and Algorithm 0.2 also assumes that we 
can multiply a number by a power of 10 (which is, after all, a sim- 
ple shift). Suppose that x and y are two integers of n = 20 decimal 
digits each. (This roughly corresponds to 64 binary digits, which is 
a common size in many programming languages.) Computing x - y 
using Algorithm 0.1 entails adding 7 to itself y times which entails 
(since y is a 20-digit number) at least 101° additions. In contrast, the 
grade-school algorithm (i.e., Algorithm 0.2) involves n? shifts and 
single-digit products, and so at most 2n? = 800 single-digit opera- 
tions. To understand the difference, consider that a grade-schooler can 
perform a single-digit operation in about 2 seconds, and so would re- 
quire about 1,600 seconds (about half an hour) to compute z - y using 
Algorithm 0.2. In contrast, even though it is more than a billion times 
faster than a human, if we used Algorithm 0.1 to compute zx - y using a 
modern PC, it would take us 10?°/10° = 101! seconds (which is more 
than three millennia!) to compute the same result. 

Computers have not made algorithms obsolete. On the contrary, 
the vast increase in our ability to measure, store, and communicate 
data has led to much higher demand for developing better and more 
sophisticated algorithms that empower us to make better decisions 
based on these data. We also see that in no small extent the notion of 
algorithm is independent of the actual computing device that executes 
it. The digit-by-digit multiplication algorithm is vastly better than 
iterated addition, regardless of whether the technology we use to 
implement it is a silicon-based chip, or a third-grader with pen and 
paper. 

Theoretical computer science is concerned with the inherent proper- 
ties of algorithms and computation; namely, those properties that are 
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independent of current technology. We ask some questions that were 
already pondered by the Babylonians, such as “what is the best way to 
multiply two numbers?”, but also questions that rely on cutting-edge 
science such as “could we use the effects of quantum entanglement to 


factor numbers faster?”. 


0.2 EXTENDED EXAMPLE: A FASTER WAY TO MULTIPLY (OP- 
TIONAL) 


Once you think of the standard digit-by-digit multiplication algo- 
rithm, it seems like the “obviously best’’ way to multiply numbers. 
In 1960, the famous mathematician Andrey Kolmogorov organized 

a seminar at Moscow State University in which he conjectured that 
every algorithm for multiplying two n digit numbers would require 
a number of basic operations that is proportional to n? (Q(n?) opera- 
tions, using O-notation as defined in Chapter 1). In other words, Kol- 
mogorov conjectured that in any multiplication algorithm, doubling 
the number of digits would quadruple the number of basic operations 


required. A young student named Anatoly Karatsuba was in the au- 
dience, and within a week he disproved Kolmogorov’s conjecture by 
discovering an algorithm that requires only about Cn‘ ® operations 
for some constant C'. Such a number becomes much smaller than n? 
as n grows and so for large n Karatsuba’s algorithm is superior to the 
grade-school one. (For example, Python’s implementation switches 
from the grade-school algorithm to Karatsuba’s algorithm for num- 
bers that are 1000 bits or larger.) While the difference between an 
O(n'®) and an O(n”) algorithm can be sometimes crucial in practice 
(see Section 0.3 below), in this book we will mostly ignore such dis- 
tinctions. However, we describe Karatsuba’s algorithm below since it 
is a good example of how algorithms can often be surprising, as well 
as a demonstration of the analysis of algorithms, which is central to this 
book and to theoretical computer science at large. 

Karatsuba’s algorithm is based on a faster way to multiply two-digit 
numbers. Suppose that x,y € [100] = {0,...,99} are a pair of two- 
digit numbers. Let’s write 7 for the “tens” digit of x, and x for the 
“ones” digit, so that x = 10% + x, and write similarly y = 10y + y for 
@,x,Y,y € [10]. The grade-school algorithm for multiplying x and y is 
illustrated in Fig. 1. 

The grade-school algorithm can be thought of as transforming the 
task of multiplying a pair of two-digit numbers into four single-digit 
multiplications via the formula 


(10% + x) x (10y + y) = 100zy + 10(zy + £y) + zy (1) 


Generally, in the grade-school algorithm doubling the number of 
digits in the input results in quadrupling the number of operations, 
leading to an O(n?) times algorithm. In contrast, Karatsuba’s algo- 
rithm is based on the observation that we can express Eq. (1) also 
as 


(10r+2) x (10y+y) = (100—10)zy+10 [( + z) + y)]—(10—1)zy (2) 


which reduces multiplying the two-digit number x and y to com- 
puting the following three simpler products: Ty, xy and (z + x)(y + y). 
By repeating the same strategy recursively, we can reduce the task of 
multiplying two n-digit numbers to the task of multiplying three pairs 
of |n/2| + 1 digit numbers.? Since every time we double the number of 
digits we triple the number of operations, we will be able to multiply 
numbers of n = 2 digits using about 36 = n823 ~ n1585 operations. 

The above is the intuitive idea behind Karatsuba’s algorithm, but is 
not enough to fully specify it. A complete description of an algorithm 
entails a precise specification of its operations together with its analysis: 
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= — 
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X9 X9 


Z5 (y) Xo 


Figure 1: The grade-school multiplication algorithm 
illustrated for multiplying x = 10% + zx and y = 
107 + y. It uses the formula (10% + x) x (107 + y) = 
100zy + 10(£y + xy) + £y. g 


3 If x is a number then || is the integer obtained by 
rounding it down, see Section 1.7. 
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proof that the algorithm does in fact do what it’s supposed to do. The 
operations of Karatsuba’s algorithm are detailed in Algorithm 0.4, 
while the analysis is given in Lemma 0.5 and Lemma 0.6. 


Algorithm 0.4 is only half of the full description of Karatsuba’s 
algorithm. The other half is the analysis, which entails proving that (1) 
Algorithm 0.4 indeed computes the multiplication operation and (2) 
it does so using O(n'°82*) operations. We now turn to showing both 


facts: 


Lemma 0.5 For every non-negative integers x, y, when given input x, y 
Algorithm 0.4 will output z - y. 


Proof. Let n be the maximum number of digits of x and y. We prove 
the lemma by induction on n. The base case is n < 4 where the algo- 
rithm returns z - y by definition. (It does not matter which algorithm 
we use to multiply four-digit numbers - we can even use repeated 
addition.) Otherwise, ifn > 4, we definem = |n/2]|, and write 
x = 10% +xandy=10"y+y. 

Plugging this into x - y, we get 


z: y = 10? ay + 10” (Ty + 27) + zy . (3) 


Rearranging the terms we see that 


z- y = 10™zy +10” [T +x) +y) — zy — zy] +zy. (4) 


since the numbers 2,7, y,y,Z + x,y + y all have at most m + 2 < n digits, 

the induction hypothesis implies that the values A, B, C computed 

by the recursive calls will satisfy A = zy, B = (& + x)(y+y) and 

C = cy. Plugging this into (4) we see that x - y equals the value 

(102 — 10%) - 4+10™-B+(1—10™)-C computed by Algorithm 0.4. 
a 


x 
2 


S| XI 


es -ar oc 
X9 -9-4 


X5 (žy) ko 


Figure 2: Karatsuba’s multiplication algorithm illus- 
trated for multiplying x = 10% + x and y = 109 + y. 
We compute the three orange, green and purple prod- 
ucts xy, ZY and (Œ + x)(y+ y) and then add and 
subtract them to obtain the result. 
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5000 10000 15000 20000 25000 30000 

input length 
Figure 3: Running time of Karatsuba’s algorithm 
vs. the grade-school algorithm. (Python implementa- 
tion available online.) Note the existence of a “cutoff” 
length, where for sufficiently large inputs Karat- 
suba becomes more efficient than the grade-school 
algorithm. The precise cutoff location varies by imple- 
mentation and platform details, but will always occur 
eventually. 
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Lemma 0.6 If x, y are integers of at most n digits, Algorithm 0.4 will 
take O(n'°82°) operations on input z, y. 


Proof. Fig. 2 illustrates the idea behind the proof, which we only 
sketch here, leaving filling out the details as Exercise 0.4. The proof 

is again by induction. We define T(n) to be the maximum number of 
steps that Algorithm 0.4 takes on inputs of length at most n. Since in 
the base case n < 4, Exercise 0.4 performs a constant number of com- 
putation, we know that T(4) < c for some constant c and for n > 4, it 
satisfies the recursive equation 


T(n) < 3T(|n/2| +1)+c'n (5) 


for some constant c’ (using the fact that addition can be done in O(n) 
operations). 

The recursive equation (5) solves to O(n'°82*). The intuition be- 
hind this is presented in Fig. 2, and this is also a consequence of the 
so-called “Master Theorem” on recurrence relations. As mentioned 
above, we leave completing the proof to the reader as Exercise 0.4. 

a 


work per level: Figure 4: Karatsuba’s algorithm reduces an n-bit 

cn multiplication to three n/2-bit multiplications, 
which in turn are reduced to nine n/4-bit multi- 
plications and so on. We can represent the compu- 
tational cost of all these multiplications in a 3-ary 
tree of depth log, n, where at the root the extra cost 
is cn operations, at the first level the extra cost is 
aren c(n/2) operations, and at each of the 3Ż nodes of 
/¥ en/¥ en/4 level i, the extra cost is c(n/2*). The total cost is 
en oe (3/2)? < 10cn!°823 by the formula for 
summing a geometric series. 


3 
5 on 


log, n levels 


f= 


OO OD OOOO  seer=oentes 


cn/2* = O(1) 


Karatsuba’s algorithm is by no means the end of the line for multi- 
plication algorithms. In the 1960’s, Toom and Cook extended Karat- 
suba’s ideas to get an O(n'°8'?*))) time multiplication algorithm for 
every constant k. In 1971, Schénhage and Strassen got even better al- 
gorithms using the Fast Fourier Transform, their idea was to somehow 
treat integers as “signals” and do the multiplication more efficiently 
by moving to the Fourier domain. (The Fourier transform is a central 
tool in mathematics and engineering, used in a great many applica- 
tions; if you have not seen it yet, you are likely to encounter it at some 
point in your studies.) In the years that followed researchers kept im- 
proving the algorithm, and only very recently Harvey and Van Der 
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Hoeven managed to obtain an O(n logn) time algorithm for multipli- 
cation (though it only starts beating the Schénhage-Strassen algorithm 
for truly astronomical numbers). Yet, despite all this progress, we 

still don’t know whether or not there is an O(n) time algorithm for 
multiplying two n digit numbers! 


@) 


Remark 0.7 — Matrix Multiplication (advanced note). 
(This book contains many “advanced” or “optional” 
notes and sections. These may assume background 
that not every student has, and can be safely skipped 
over as none of the future parts depends on them.) 
Ideas similar to Karatsuba’s can be used to speed up 
matrix multiplications as well. Matrices are a powerful 
way to represent linear equations and operations, 
widely used in numerous applications of scientific 
computing, graphics, machine learning, and many 
many more. 

One of the basic operations one can do with 

two matrices is to multiply them. For example, 


Tee at Zoo Von and y = Yo Yo 
Tio Tir Yo Yi 

then the product of x and y is the matrix 
X9,0Y0,0 T %0,141,0 Yo,0Yo,1 T %0141,1 
00,0 T %1,191,0 %1,0Y0,1 T %1,191,1 


see that we can compute this matrix by eight products 
of numbers. 


. You can 


Now suppose that n is even and z and y are a pair of 

n x n matrices which we can think of as each com- 
posed of four (n/2) x (n/2) blocks x 9,291; %1,0 %1,1 
and Yo 9; Yo,1; Y1,0; ¥i,1- Then the formula for the matrix 
product of x and y can be expressed in the same way 
as above, just replacing products x, Yea with matrix 
products, and addition with matrix addition. This 
means that we can use the formula above to give an 
algorithm that doubles the dimension of the matrices 
at the expense of increasing the number of operations 
by a factor of 8, which forn = 2‘ resultsin86 = n 
operations. 

In 1969 Volker Strassen noted that we can compute 
the product of a pair of two-by-two matrices using 
only seven products of numbers by observing that 
each entry of the matrix xy can be computed by 
adding and subtracting the following seven terms: 


3 


ty = (219 + 11) (Yo0 + Ya), t2 = (Lo + 1,1)¥0,07 
ig = Lo,0(Yo,1 T Y ta = £1 1(Yo,1 = Yo,0), 
in = (Zo. ae To,1)Y1,17 t6 = (£10 z To,0)(Y1,0 + Y1) 
tr = (zo1 — %11)(Y1,.0 + Y1,1). Indeed, one can verify 


that zy = ti +t, — ts + by t3 + ts 
to + t4 ti +t — ta +te J` 


0.3 ALGORITHMS BEYOND ARITHMETIC 


The quest for better algorithms is by no means restricted to arithmetic 
tasks such as adding, multiplying or solving equations. Many graph 
algorithms, including algorithms for finding paths, matchings, span- 
ning trees, cuts, and flows, have been discovered in the last several 
decades, and this is still an intensive area of research. (For example, 
the last few years saw many advances in algorithms for the maximum 
flow problem, borne out of unexpected connections with electrical cir- 
cuits and linear equation solvers.) These algorithms are being used 
not just for the “natural” applications of routing network traffic or 
GPS-based navigation, but also for applications as varied as drug dis- 
covery through searching for structures in gene-interaction graphs to 
computing risks from correlations in financial investments. 
Google was founded based on the PageRank algorithm, which is 

an efficient algorithm to approximate the “principal eigenvector” of 
(a dampened version of) the adjacency matrix of the web graph. The 
Akamai company was founded based on a new data structure, known 
as consistent hashing, for a hash table where buckets are stored at dif- 
ferent servers. The backpropagation algorithm, which computes partial 
derivatives of a neural network in O(n) instead of O(n”) time, under- 
lies many of the recent phenomenal successes of learning deep neural 
networks. Algorithms for solving linear equations under sparsity 
constraints, a concept known as compressed sensing, have been used 

to drastically reduce the amount and quality of data needed to ana- 
lyze MRI images. This made a critical difference for MRI imaging of 
cancer tumors in children, where previously doctors needed to use 
anesthesia to suspend breath during the MRI exam, sometimes with 
dire consequences. 


INTRODUCTION 37 


38 INTRODUCTION TO THEORETICAL COMPUTER SCIENCE 


Even for classical questions, studied through the ages, new dis- 
coveries are still being made. For example, for the question of de- 
termining whether a given integer is prime or composite, which has 
been studied since the days of Pythagoras, efficient probabilistic algo- 
rithms were only discovered in the 1970s, while the first deterministic 
polynomial-time algorithm was only found in 2002. For the related 
problem of actually finding the factors of a composite number, new 
algorithms were found in the 1980s, and (as we'll see later in this 
course) discoveries in the 1990s raised the tantalizing prospect of 
obtaining faster algorithms through the use of quantum mechanical 
effects. 

Despite all this progress, there are still many more questions than 
answers in the world of algorithms. For almost all natural prob- 
lems, we do not know whether the current algorithm is the “best”, 
or whether a significantly better one is still waiting to be discovered. 
As alluded to in Cobham’s opening quote for this chapter, even for 
the basic problem of multiplying numbers we have not yet answered 
the question of whether there is a multiplication algorithm that is as 
efficient as our algorithms for addition. But at least we now know the 
right way to ask it. 


0.4 ON THE IMPORTANCE OF NEGATIVE RESULTS 


Finding better algorithms for problems such as multiplication, solv- 
ing equations, graph problems, or fitting neural networks to data, is 
undoubtedly a worthwhile endeavor. But why is it important to prove 
that such algorithms don’t exist? One motivation is pure intellectual 
curiosity. Another reason to study impossibility results is that they 
correspond to the fundamental limits of our world. In other words, 
impossibility results are laws of nature. 

Here are some examples of impossibility results outside computer 
science (see Section 0.7 for more about these). In physics, the impos- 
sibility of building a perpetual motion machine corresponds to the law 
of conservation of energy. The impossibility of building a heat engine 
beating Carnot’s bound corresponds to the second law of thermody- 
namics, while the impossibility of faster-than-light information trans- 
mission is a cornerstone of special relativity. In mathematics, while we 
all learned the formula for solving quadratic equations in high school, 
the impossibility of generalizing this formula to equations of degree 
five or more gave birth to group theory. The impossibility of proving 
Euclid’s fifth axiom from the first four gave rise to non-Euclidean ge- 
ometries, which ended up crucial for the theory of general relativity. 

In an analogous way, impossibility results for computation corre- 
spond to “computational laws of nature” that tell us about the fun- 
damental limits of any information processing apparatus, whether 


based on silicon, neurons, or quantum particles. Moreover, computer 
scientists found creative approaches to apply computational limitations 
to achieve certain useful tasks. For example, much of modern Internet 
traffic is encrypted using the RSA encryption scheme, the security of 
which relies on the (conjectured) impossibility of efficiently factoring 
large integers. More recently, the Bitcoin system uses a digital ana- 

log of the “gold standard” where, instead of using a precious metal, 
new currency is obtained by “mining” solutions for computationally 


difficult problems. 


0.5 ROADMAP TO THE REST OF THIS BOOK 


Often, when we try to solve a computational problem, whether it is 
solving a system of linear equations, finding the top eigenvector of a 
matrix, or trying to rank Internet search results, it is enough to use the 
“T know it when I see it” standard for describing algorithms. As long 
as we find some way to solve the problem, we are happy and might 
not care much on the exact mathematical model for our algorithm. 
But when we want to answer a question such as “does there exist an 
algorithm to solve the problem P?” we need to be much more precise. 
In particular, we will need to (1) define exactly what it means to 
solve P, and (2) define exactly what an algorithm is. Even (1) can 
sometimes be non-trivial but (2) is particularly challenging; it is not 
at all clear how (and even whether) we can encompass all potential 
ways to design algorithms. We will consider several simple models of 
computation, and argue that, despite their simplicity, they do capture 
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all “reasonable” approaches to achieve computing, including all those 
that are currently used in modern computing devices. 

Once we have these formal models of computation, we can try 
to obtain impossibility results for computational tasks, showing that 
some problems can not be solved (or perhaps can not be solved within 
the resources of our universe). Archimedes once said that given a 
fulcrum and a long enough lever, he could move the world. We will 
see how reductions allow us to leverage one hardness result into a 
slew of a great many others, illuminating the boundaries between 
the computable and uncomputable (or tractable and intractable) 
problems. 

Later in this book we will go back to examining our models of 
computation, and see how resources such as randomness or quantum 
entanglement could potentially change the power of our model. In 
the context of probabilistic algorithms, we will see a glimpse of how 
randomness has become an indispensable tool for understanding 
computation, information, and communication. We will also see how 
computational difficulty can be an asset rather than a hindrance, and 
be used for the “derandomization” of probabilistic algorithms. The 
same ideas also show up in cryptography, which has undergone not 
just a technological but also an intellectual revolution in the last few 
decades, much of it building on the foundations that we explore in 
this course. 

Theoretical Computer Science is a vast topic, branching out and 
touching upon many scientific and engineering disciplines. This book 
provides a very partial (and biased) sample of this area. More than 
anything, I hope I will manage to “infect” you with at least some of 
my love for this field, which is inspired and enriched by the connec- 
tion to practice, but is also deep and beautiful regardless of applica- 
tions. 


0.5.1 Dependencies between chapters 
This book is divided into the following parts, see Fig. 5. 


e Preliminaries: Introduction, mathematical background, and repre- 
senting objects as strings. 


e Part I: Finite computation (Boolean circuits): Equivalence of cir- 
cuits and straight-line programs. Universal gate sets. Existence of a 
circuit for every function, representing circuits as strings, universal 
circuit, lower bound on circuit size using the counting argument. 


e Part II: Uniform computation (Turing machines): Equivalence of 
Turing machines and programs with loops. Equivalence of models 
(including RAM machines, 4 calculus, and cellular automata), 
configurations of Turing machines, existence of a universal Turing 
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machine, uncomputable functions (including the Halting problem 
and Rice’s Theorem), Gédel’s incompleteness theorem, restricted 
computational models (regular and context free languages). 


e Part III: Efficient computation: Definition of running time, time 
hierarchy theorem, P and NP, P pory: NP completeness and the 
Cook-Levin Theorem, space bounded computation. 


e Part IV: Randomized computation: Probability, randomized algo- 
rithms, BPP, amplification, BPP C P Ipolyi pseudorandom genera- 
tors and derandomization. 


e Part V: Advanced topics: Cryptography, proofs and algorithms 
(interactive and zero knowledge proofs, Curry-Howard correspon- 
dence), quantum computing. 


Figure 5: The dependency structure of the different 
(Boolean circuits) Part Il: Uniform computation parts. Part I introduces the model of Boolean cir- 
Functions on finite inputs A | (Turing machines) cuits to study finite functions with an emphasis on 
Quantitative study “| Functions on arbitrary length inputs. quantitative questions (how many gates to compute 
Qualitative study a function). Part II introduces the model of Turing 
machines to study functions that have unbounded input 
lengths with an emphasis on qualitative questions (is 
Part Ill: Efficient computation this function computable or not). Much of Part II does 
(Turing machines & circuits) not depend on Part I, as Turing machines can be used 
Functions on arbitrary length inputs. as the first computational model. Part III depends 
Quantitative study on both parts as it introduces a quantitative study of 
functions with unbounded input length. The more 
advanced parts IV (randomized computation) and 
V (advanced topics) rely on the material of Parts I, II 
and MI. 


Part |: Finite computation 


Part IV: Randomized computation 
Relations to both uniform and 
nonuniform classes. Hardness as a 
resource. 


Part V: Advanced Topics. 


The book largely proceeds in linear order, with each chapter build- 


ing on the previous ones, with the following exceptions: 


e The topics of A calculus (Section 8.5 and Section 8.5), Gédel’s in- 
completeness theorem (Chapter 11), Automata/regular expres- 
sions and context-free grammars (Chapter 10), and space-bounded 
computation (Chapter 17), are not used in the following chapters. 
Hence you can choose whether to cover or skip any subset of them. 


e Part II (Uniform Computation / Turing Machines) does not have 
a strong dependency on Part I (Finite computation / Boolean cir- 
cuits) and it should be possible to teach them in the reverse order 
with minor modification. Boolean circuits are used Part III (efficient 
computation) for results such as P C P pory and the Cook-Levin 
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Theorem, as well as in Part IV (for BPP C P pory and derandom- 
ization) and Part V (specifically in cryptography and quantum 
computing). 


e All chapters in Part V (Advanced topics) are independent of one 
another and can be covered in any order. 


A course based on this book can use all of Parts I, II, and III (possi- 
bly skipping over some or all of the \ calculus, Chapter 11, Chapter 10 
or Chapter 17), and then either cover all or some of Part IV (random- 
ized computation), and add a “sprinkling” of advanced topics from 
Part V based on student or instructor interest. 


0.6 EXERCISES 


Exercise 0.1 Rank the significance of the following inventions in speed- 
ing up the multiplication of large (that is 100-digit or more) numbers. 
That is, use “back of the envelope” estimates to order them in terms of 
the speedup factor they offered over the previous state of affairs. 


a. Discovery of the grade-school digit by digit algorithm (improving 
upon repeated addition). 


b. Discovery of Karatsuba’s algorithm (improving upon the digit by 
digit algorithm). 


c. Invention of modern electronic computers (improving upon calcu- 
lations with pen and paper). 


Exercise 0.2 The 1977 Apple II personal computer had a processor 
speed of 1.023 Mhz or about 10° operations per second. At the 

time of this writing the world’s fastest supercomputer performs 93 
“petaflops” (10!° floating point operations per second) or about 101% 
basic steps per second. For each one of the following running times 
(as a function of the input length n), compute for both computers how 
large an input they could handle in a week of computation, if they run 
an algorithm that has this running time: 


a. n operations. 

b. n? operations. 

c. nlogn operations. 
d. 2” operations. 


e. n! operations. 


Exercise 0.3 — Usefulness of algorithmic non-existence. In this chapter we 
mentioned several companies that were founded based on the discov- 
ery of new algorithms. Can you give an example for a company that 
was founded based on the non-existence of an algorithm? See footnote 
for hint.4 


Exercise 0.4 — Analysis of Karatsuba’s Algorithm. a. Suppose that 
Tı, Tə, T}, ... is a sequence of numbers such that T, < 10 and 
for every n, Ta < 37 \n/2)41 + Cn for some C > 1. Prove that 
T, < 20Cn'°82 for every n > 2.° 


b. Prove that the number of single-digit operations that Karatsuba’s 
algorithm takes to multiply two n digit numbers is at most 
1000n!°82 9, 


Exercise 0.5 Implement in the programming language of your 

choice functions Gradeschool_multiply(x, y) and Karat- 
suba_multiply(x,y) that take two arrays of digits x and y and return 
an array representing the product of x and y (where x is identified 
with the number x[0]+10*x[1]+100*x[2]+... etc..) using the 
grade-school algorithm and the Karatsuba algorithm respectively. 

At what number of digits does the Karatsuba algorithm beat the 
grade-school one? 


Exercise 0.6 — Matrix Multiplication (optional, advanced). In this exercise, we 
show that if for some w > 2, we can write the product of two k x k 
real-valued matrices A, B using at most k” multiplications, then we 
can multiply two n x n matrices in roughly n“ time for every large 
enough n. 

To make this precise, we need to make some notation that is unfor- 
tunately somewhat cumbersome. Assume that there is some k € N 
andm < k“ such that for every k x k matrices A, B,C such that 
C = AB, we can write for every i, j € [k]: 


m1 


C5 = at, fe(A)ge(B) 


for some linear functions fo, .-., fm—-13 90> -3 Jm—1 : R”? — Rand 
coefficients {af jli jelk],2eļm]: Prove that under this assumption for 
every € > 0, ifn is sufficiently large, then there is an algorithm that 
computes the product of two n x n matrices using at most O(n”**) 
arithmetic operations. See footnote for hint.® 
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* As we will see in Chapter Chapter 21, almost any 
company relying on cryptography needs to assume 
the non-existence of certain algorithms. In particular, 
RSA Security was founded based on the security 

of the RSA cryptosystem, which presumes the non- 
existence of an efficient algorithm to compute the 
prime factorization of large integers. 


° Hint: Use a proof by induction - suppose that this is 
true for all n’s from 1 to m and prove that this is true 
also for m + 1. 


é Start by showing this for the case that n = kt for 
some natural number t, in which case you can do so 
recursively by breaking the matrices into k x k blocks. 
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0.7 BIBLIOGRAPHICAL NOTES 


For a brief overview of what we'll see in this book, you could do far 
worse than read Bernard Chazelle’s wonderful essay on the Algo- 
rithm as an Idiom of modern science. The book of Moore and Mertens 
[MM11] gives a wonderful and comprehensive overview of the theory 
of computation, including much of the content discussed in this chap- 
ter and the rest of this book. Aaronson’s book [Aar13] is another great 
read that touches upon many of the same themes. 

For more on the algorithms the Babylonians used, see Knuth’s 
paper and Neugebauer’s classic book. 

Many of the algorithms we mention in this chapter are covered 
in algorithms textbooks such as those by Cormen, Leiserson, Rivest, 
and Stein [Cor+09], Kleinberg and Tardos [KT06], and Dasgupta, Pa- 
padimitriou and Vazirani [DPV08], as well as Jeff Erickson’s textbook. 
Erickson’s book is freely available online and contains a great exposi- 
tion of recursive algorithms in general and Karatsuba’s algorithm in 
particular. 

The story of Karatsuba’s discovery of his multiplication algorithm 
is recounted by him in [Kar95]. As mentioned above, further improve- 
ments were made by Toom and Cook [To063; Coo66], Schénhage and 
Strassen [SS71], Fürer [Fiir07], and recently by Harvey and Van Der 
Hoeven [HV19], see this article for a nice overview. The last papers 
crucially rely on the Fast Fourier transform algorithm. The fascinating 
story of the (re)discovery of this algorithm by John Tukey in the con- 
text of the cold war is recounted in [Coo87]. (We say re-discovery 
because it later turned out that the algorithm dates back to Gauss 
[ HJB85].) The Fast Fourier Transform is covered in some of the books 
mentioned below, and there are also online available lectures such as 
Jeff Erickson’s. See also this popular article by David Austin. Fast ma- 
trix multiplication was discovered by Strassen [Str69], and since then 
this has been an active area of research. [Blä13] is a recommended 
self-contained survey of this area. 

The Backpropagation algorithm for fast differentiation of neural net- 
works was invented by Werbos [Wer74]. The Pagerank algorithm was 
invented by Larry Page and Sergey Brin [Pag+99]. It is closely related 
to the HITS algorithm of Kleinberg [Kle99]. The Akamai company was 
founded based on the consistent hashing data structure described in 
[Kar+97]. Compressed sensing has a long history but two foundational 
papers are [CRT06; Don06]. [Lus+08] gives a survey of applications 
of compressed sensing to MRI; see also this popular article by Ellen- 
berg [E1110]. The deterministic polynomial-time algorithm for testing 
primality was given by Agrawal, Kayal, and Saxena [AKS04]. 
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We alluded briefly to classical impossibility results in mathematics, 
including the impossibility of proving Euclid’s fifth postulate from the 
other four, impossibility of trisecting an angle with a straightedge and 
compass and the impossibility of solving a quintic equation via rad- 
icals. A geometric proof of the impossibility of angle trisection (one 
of the three geometric problems of antiquity, going back to the an- 
cient Greeks) is given in this blog post of Tao. The book of Mario Livio 
[Liv05] covers some of the background and ideas behind these impos- 
sibility results. Some exciting recent research is focused on trying to 
use computational complexity to shed light on fundamental questions 
in physics such as understanding black holes and reconciling general 
relativity with quantum mechanics 
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Learning Objectives: 


Recall basic mathematical notions such as 
sets, functions, numbers, logical operators 
and quantifiers, strings, and graphs. 


Rigorously define Big-O notation. 
Proofs by induction. 


Practice with reading mathematical 
definitions, statements, and proofs. 


Transform an intuitive argument into a 
1 rigorous proof. 


Mathematical Background 


“I found that every number, which may be expressed from one to ten, surpasses 
the preceding by one unit: afterwards the ten is doubled or tripled ... until 

a hundred; then the hundred is doubled and tripled in the same manner as 

the units and the tens ... and so forth to the utmost limit of numeration.”, 
Muhammad ibn Masa al-Khwarizmi, 820, translation by Fredric Rosen, 
1831. 


In this chapter we review some of the mathematical concepts that 
we use in this book. These concepts are typically covered in courses 
or textbooks on “mathematics for computer science” or “discrete 
mathematics”; see the “Bibliographical Notes” section (Section 1.9) 
for several excellent resources on these topics that are freely-available 
online. 

A mathematician’s apology. Some students might wonder why this 
book contains so much math. The reason is that mathematics is sim- 
ply a language for modeling concepts in a precise and unambiguous 
way. In this book we use math to model the concept of computation. 
For example, we will consider questions such as “is there an efficient 
algorithm to find the prime factors of a given integer?”. (We will see that 
this question is particularly interesting, touching on areas as far apart 
as Internet security and quantum mechanics!) To even phrase sucha 
question, we need to give a precise definition of the notion of an algo- 
rithm, and of what it means for an algorithm to be efficient. Also, since 
there is no empirical experiment to prove the nonexistence of an algo- 
rithm, the only way to establish such a result is using a mathematical 


proof. 


1.1 THIS CHAPTER: A READER’S MANUAL 


Depending on your background, you can approach this chapter in two 
different ways: 


n oa 


e If you have already taken “discrete mathematics”, “mathematics 
for computer science” or similar courses, you do not need to read 
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the whole chapter. You can just take a quick look at Section 1.2 to 
see the main tools we will use, Section 1.7 for our notation and con- 
ventions, and then skip ahead to the rest of this book. Alternatively, 
you can sit back, relax, and read this chapter just to get familiar 
with our notation, as well as to enjoy (or not) my philosophical 
musings and attempts at humor. 


If your background is less extensive, see Section 1.9 for some re- 
sources on these topics. This chapter briefly covers the concepts 
that we need, but you may find it helpful to see a more in-depth 
treatment. As usual with math, the best way to get comfortable 

with this material is to work out exercises on your own. 


You might also want to start brushing up on discrete probability, 
which we'll use later in this book (see Chapter 18). 


1.2 A QUICK OVERVIEW OF MATHEMATICAL PREREQUISITES 


The main mathematical concepts we will use are the following. We 


just list these notions below, deferring their definitions to the rest of 


this chapter. If you are familiar with all of these, then you might want 


to just skip to Section 1.7 to see the full list of notation we use. 


Proofs: First and foremost, this book involves a heavy dose of for- 
mal mathematical reasoning, which includes mathematical defini- 
tions, statements, and proofs. 


Sets and set operations: We will use extensively mathematical sets. 
We use the basic set relations of membership (€) and containment 
(C), and set operations, principally union (U), intersection (N), and 
set difference (\). 


Cartesian product and Kleene star operation: We also use the 
Cartesian product of two sets A and B, denoted as A x B (that is, 

A x B the set of pairs (a,b) where a € A and b € B). We denote by 
A” the n fold Cartesian product (e.g., A? = A x A x A) and by A* 
(known as the Kleene star) the union of A” for all n € {0, 1, 2,...}. 


Functions: The domain and codomain of a function, properties such 
as being one-to-one (also known as injective) or onto (also known 
as surjective) functions, as well as partial functions (that, unlike 
standard or “total” functions, are not necessarily defined on all 
elements of their domain). 


Logical operations: The operations AND (^), OR (v), and NOT 
(~) and the quantifiers “there exists” (4) and “for all” (V). 


Basic combinatorics: Notions such as (%) (the number of k-sized 
subsets of a set of size n). 


e Graphs: Undirected and directed graphs, connectivity, paths, and 
cycles. 


e Big-O notation: O, 0, Q, w, © notation for analyzing asymptotic 
growth of functions. 


e Discrete probability: We will use probability theory, and specifi- 
cally probability over finite samples spaces such as tossing n coins, 
including notions such as random variables, expectation, and concen- 
tration. We will only use probability theory in the second half of 
this text, and will review it beforehand in Chapter 18. However, 
probabilistic reasoning is a subtle (and extremely useful!) skill, and 
it’s always good to start early in acquiring it. 


In the rest of this chapter we briefly review the above notions. This 
is partially to remind the reader and reinforce material that might 
not be fresh in your mind, and partially to introduce our notation 
and conventions which might occasionally differ from those you’ve 
encountered before. 


1.3 READING MATHEMATICAL TEXTS 


Mathematicians use jargon for the same reason that it is used in many 
other professions such as engineering, law, medicine, and others. We 
want to make terms precise and introduce shorthand for concepts 

that are frequently reused. Mathematical texts tend to “pack a lot 

of punch” per sentence, and so the key is to read them slowly and 
carefully, parsing each symbol at a time. 

With time and practice you will see that reading mathematical texts 
becomes easier and jargon is no longer an issue. Moreover, reading 
mathematical texts is one of the most transferable skills you could take 
from this book. Our world is changing rapidly, not just in the realm 
of technology, but also in many other human endeavors, whether it 
is medicine, economics, law or even culture. Whatever your future 
aspirations, it is likely that you will encounter texts that use new con- 
cepts that you have not seen before (see Fig. 1.1 and Fig. 1.2 for two 
recent examples from current “hot areas”). Being able to internalize 
and then apply new definitions can be hugely important. It is a skill 
that’s much easier to acquire in the relatively safe and stable context of 
a mathematical course, where one at least has the guarantee that the 
concepts are fully specified, and you have access to your teaching staff 
for questions. 

The basic components of a mathematical text are definitions, asser- 
tions and proofs. 
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Figure 1.1: A snippet from the “methods” section of 
the “AlphaGo Zero” paper by Silver et al, Nature, 2017. 


Coins are spent using the pour operation, which takes a set of input coins, to Be consumed, and 
a r 


we call POUR: 


“Given the Merkle-tree root rt, serial number sn%%, and coin commitments cmi™,cm3™, I 

3", and address secret key af’ such that: 

L-formed: for c% it holds that es COMM, «1 (agii || p1) and cm? 
1k"); and similarly for ct and eg 

secret key matches the public key: agë = PREJ" 


know coins ec! 
© The coins a 


© The values add up: vf + v3 


Figure 1.2: A snippet from the “Zerocash” paper of 
Ben-Sasson et al, that forms the basis of the cryptocur- 
rency startup Zcash. 
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1.3.1 Definitions 

Mathematicians often define new concepts in terms of old concepts. 
For example, here is a mathematical definition which you may have 
encountered in the past (and will see again shortly): 


Definition 1.1 — One to one function. Let S, T be sets. We say that a 
function f : S — T is one to one (also known as injective) if for every 
two elements z, x’ E€ S, if x # x’ then f(x) # f(z’). 


Definition 1.1 captures a simple concept, but even so it uses quite 
a bit of notation. When reading such a definition, it is often useful to 
annotate it with a pen as you're going through it (see Fig. 1.3). For 
example, when you see an identifier such as f, S or x, make sure that 
you realize what sort of object it is: is it a set, a function, an element, 
a number, a gremlin? You might also find it useful to explain the 
definition in words to a friend (or to yourself). 


1.3.2 Assertions: Theorems, lemmas, claims 

Theorems, lemmas, claims and the like are true statements about the 
concepts we defined. Deciding whether to call a particular statement a 
“Theorem”, a “Lemma” or a “Claim” is a judgement call, and does not 
make a mathematical difference. All three correspond to statements 
which were proven to be true. The difference is that a Theorem refers to 
a significant result that we would want to remember and highlight. A 
Lemma often refers to a technical result that is not necessarily impor- 
tant in its own right, but that can be often very useful in proving other 
theorems. A Claim is a “throwaway” statement that we need to use 

in order to prove some other bigger results, but do not care so much 
about for its own sake. 


1.3.3 Proofs 

Mathematical proofs are the arguments we use to demonstrate that our 
theorems, lemmas, and claims are indeed true. We discuss proofs in 
Section 1.5 below, but the main point is that the mathematical stan- 
dard of proof is very high. Unlike in some other realms, in mathe- 
matics a proof is an “airtight” argument that demonstrates that the 
statement is true beyond a shadow of a doubt. Some examples in this 
section for mathematical proofs are given in Solved Exercise 1.1 and 
Section 1.6. As mentioned in the preface, as a general rule, it is more 
important you understand the definitions than the theorems, and it is 
more important you understand a theorem statement than its proof. 


object concept appliesto concept being defined, emphasized. 


Definition 1.1 & One to one function’ Let S, T be sets. We say that a 
functio 


condition defining the concept 


Figure 1.3: An annotated form of Definition 1.1, 
marking which part is being defined and how. 
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1.4 BASIC DISCRETE MATH OBJECTS 


In this section we quickly review some of the mathematical objects 
(the “basic data structures” of mathematics, if you will) we use in this 
book. 


1.4.1 Sets 
A set is an unordered collection of objects. For example, when we 
write S = {2,4,7}, we mean that S denotes the set that contains the 
numbers 2, 4, and 7. (We use the notation “2 € S” to denote that 2 is 
an element of S.) Note that the set {2, 4,7} and {7, 4, 2} are identical, 
since they contain the same elements. Also, a set either contains an 
element or does not contain it — there is no notion of containing it 
“twice” — and so we could even write the same set S as {2, 2, 4, 7} 
(though that would be a little weird). The cardinality of a finite set S, 
denoted by |S], is the number of elements it contains. (Cardinality can 
be defined for infinite sets as well; see the sources in Section 1.9.) So, 
in the example above, |S| = 3. A set 5 is a subset of a set T, denoted 
by S C T, if every element of S is also an element of T. (We can 
also describe this by saying that T is a superset of S.) For example, 
{2,7} C {2,4,7}. The set that contains no elements is known as the 
empty set and it is denoted by Ø. If A is a subset of B that is not equal 
to B we say that A is a strict subset of B, and denote this by A Ç B. 
We can define sets by either listing all their elements or by writing 
down a rule that they satisfy such as 


EVEN = {a | x = 2y for some non-negative integer y} . 


Of course there is more than one way to write the same set, and of- 
ten we will use intuitive notation listing a few examples that illustrate 
the rule. For example, we can also define EVEN as 


EVEN = {0,2,4,...} . 


Note that a set can be either finite (such as the set {2, 4,7}) or in- 
finite (such as the set EVEN). Also, the elements of a set don’t have 
to be numbers. We can talk about the sets such as the set {a, e, i, 0, u} 
of all the vowels in the English language, or the set {New York, Los 
Angeles, Chicago, Houston, Philadelphia, Phoenix, San Antonio, 
San Diego, Dallas} of all cities in the U.S. with population more than 
one million per the 2010 census. A set can even have other sets as ele- 
ments, such as the set {@, {1,2}, {2,3}, {1,3}} of all even-sized subsets 
of {1, 2, 3}. 


Operations onsets: The union of two sets S, T, denoted by S U T, 
is the set that contains all elements that are either in S or in T. The 
intersection of S and T, denoted by S N T, is the set of elements that are 
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both in S and in T. The set difference of S and T, denoted by S \ T (and 
in some texts also by S — T), is the set of elements that are in S but not 
inT. 


Tuples, lists, strings, sequences: A tuple is an ordered collection of items. 
For example (1,5, 2,1) is a tuple with four elements (also known as 

a 4-tuple or quadruple). Since order matters, this is not the same 
tuple as the 4-tuple (1, 1,5, 2) or the 3-tuple (1,5, 2). A 2-tuple is also 
known as a pair. We use the terms tuples and lists interchangeably. 

A tuple where every element comes from some finite set © (such as 
{0, 1}) is also known as a string. Analogously to sets, we denote the 
length of a tuple T by |T|. Just like sets, we can also think of infinite 
analogues of tuples, such as the ordered collection (1, 4,9, ...) of all 
perfect squares. Infinite ordered collections are known as sequences; 
we might sometimes use the term “infinite sequence” to emphasize 
this, and use “finite sequence” as a synonym for a tuple. (We can 
identify a sequence (dg, 41, @2, ...) of elements in some set S with a 
function A : N + S (where a,, = A(n) for every n € N). Similarly, 
we can identify a k-tuple (ap, ... ,@,_,) of elements in S with a function 
A: [k] > S.) 


Cartesian product: If © and T are sets, then their Cartesian product, 
denoted by S x T, is the set of all ordered pairs (s, t) where s € S and 
t € T. For example, if S = {1,2,3}andT = {10,12}, then S x T 
contains the 6 elements (1, 10), (2, 10), (3, 10), (1, 12), (2, 12), (3, 12). 
Similarly if S,T,U are sets then S x T x U is the set of all ordered 
triples (s,t,u) where s € S,t € T,and u € U. More generally, for 
every positive integer n and sets Sp,...,5,,_,, we denote by Sy x S x 
-- x S,_, the set of ordered n-tuples (so, ...,5,_,) where s; € S; for 
every i € {0,..., n — 1}. For every set S, we denote the set S x S by S?, 
Sx Sx S by S3, S x S x S x S by St, and so on and so forth. 


1.4.2 Special sets 
There are several sets that we will use in this book time and again. The 
set 


N = {0, 1,2, ...} 


contains all natural numbers, i.e., non-negative integers. For any natural 
number n € N, we define the set [n] as {0,...,.2 —1} = {kK EN: 
k < n}. (We start our indexing of both N and [n] from 0, while many 
other texts index those sets from 1. Starting from zero or one is simply 
a convention that doesn’t make much difference, as long as one is 
consistent about it.) 
We will also occasionally use the set Z = {... ,—2, —1, 0, +1, +2,...} 1 The letter Z stands for the German word “Zahlen”, 
of (negative and non-negative) integers,! as well as the set R of real which means numbers. 
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numbers. (This is the set that includes not just the integers, but also 
fractional and irrational numbers; e.g., contains numbers such as 
+0.5, —7, etc.) We denote by R, the set {x € R : x > 0} of positive real 
numbers. This set is sometimes also denoted as (0, 00). 


Strings: Another set we will use time and again is 


{0,1}” = { (£p En) + Lqy--->2n_1 € {0, 1}} 
which is the set of all n-length binary strings for some natural number 
n. That is {0, 1}” is the set of all n-tuples of zeroes and ones. This is 
consistent with our notation above: {0, 1}? is the Cartesian product 
{0,1} x {0,1}, {0, 1}° is the product {0,1} x {0,1} x {0,1} and so on. 
We will write the string (xp, v1,...,%,_,) as simply vx, ---,_1. For 
example, 


{0, 1}8 = {000, 001,010,011, 100, 101, 110, 111} . 


For every string x € {0,1}? andi € [n], we write x, for the i” 
element of x. 

We will also often talk about the set of binary strings of all lengths, 
which is 


{0, 1}* = { (£0; --;Zn-1) : MEN, ,2,-..,Lpn_1 E {0,1}}. 
Another way to write this set is as 
{0, Ly = {0, 1}° U {0, 1} U {0, 1}? Wes 


or more concisely as 


40, 1}* = Unen{0, 1}? 

The set {0, 1}* includes the “string of length 0” or “the empty 
string”, which we will denote by "". (In using this notation we fol- 
low the convention of many programming languages. Other texts 
sometimes use € or À to denote the empty string.) 


Generalizing the star operation: For every set X, we define 


E* = Usage . 


For example, if & = {a, b, c,d, ... , z} then £* denotes the set of all finite 
length strings over the alphabet a-z. 


Concatenation: The concatenation of two strings x € X” and y E€ X” is 
the (n + m)-length string xy obtained by writing y after x. That is, if 

x € {0,1}" and y € {0,1}, then zy is equal to the string z € {0,1}"*™ 
such that for i € [n], z; = x; and fori € {n,...,n+m-—1}, Zi = Yin- 
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1.4.3 Functions 

If S and T are non-empty sets, a function F mapping S to T, denoted 
by F : S — T, associates with every element x € San element 
F(a) € T. The set S is known as the domain of F and the set T 

is known as the codomain of F. The image of a function F is the set 
{F (x) | £ E€ S} which is the subset of F’s codomain consisting of all 
output elements that are mapped from some input. (Some texts use 
range to denote the image of a function, while other texts use range 
to denote the codomain of a function. Hence we will avoid using the 
term “range” altogether.) As in the case of sets, we can write a func- 
tion either by listing the table of all the values it gives for elements 
in S or by using a rule. For example if S = {0,1,2,3,4,5,6,7,8, 9} 
and T = {0,1}, then the table below defines a function F : S > T. 
Note that this function is the same as the function defined by the rule 
F(a) = (x mod 2). 


Table 1.1: An example of a function. 


Input Output 


0 0 
1 1 
2 0 
3 1 
4 0 
5 1 
6 0 
7 1 
8 0 
9 1 


If F : S — T satisfies that F(x) + F(y) forall x + y then we say 
that F is one-to-one (Definition 1.1, also known as an injective function 
or simply an injection). If F satisfies that for every y € T there is some 
x € S such that F(x) = y then we say that F is onto (also known as a 
surjective function or simply a surjection). A function that is both one- 
to-one and onto is known as a bijective function or simply a bijection. 
A bijection from a set S to itself is also known as a permutation of S. If 
F : S — T isa bijection then for every y € T there is a unique x € S 
such that F(x) = y. We denote this value x by F~! (y). Note that F~' 
is itself a bijection from T to S (can you see why?). 

Giving a bijection between two sets is often a good way to show 
they have the same size. In fact, the standard mathematical definition 
of the notion that “S and T have the same cardinality” is that there 


? For two natural numbers x and a,x mod a (short- 
hand for “modulo”) denotes the remainder of x 

when it is divided by a. That is, it is the number r in 
{0, ...,a@—1} such that x = ak +r for some integer k. 
We sometimes also use the notation x = y ( mod a) 
to denote the assertion that x mod a is the same as y 
mod a. 
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exists a bijection f : S — T. Further, the cardinality of a set S is 
defined to be n if there is a bijection from S to the set {0,...,2 — 1}. 
As we will see later in this book, this is a definition that generalizes to 
defining the cardinality of infinite sets. 


Partial functions: We will sometimes be interested in partial functions 
from S to T. A partial function is allowed to be undefined on some 
subset of S. That is, if F is a partial function from S to T, then for 
every s € S, either there is (as in the case of standard functions) an 
element F(s) in T, or F(s) is undefined. For example, the partial func- 
tion F(x) = yz is only defined on non-negative real numbers. When 
we want to distinguish between partial functions and standard (i.e., 
non-partial) functions, we will call the latter total functions. When we 
say “function” without any qualifier then we mean a total function. 
The notion of partial functions is a strict generalization of functions, 
and so every function is a partial function, but not every partial func- 
tion is a function. (That is, for every non-empty S and T, the set of 
partial functions from S to T is a proper superset of the set of total 
functions from S to T.) When we want to emphasize that a function 
f from A to B might not be total, we will write f : A +, B. We can 
think of a partial function F from S' to T also as a total function from 
S to T U {L} where L is a special “failure symbol”. So, instead of 
saying that F is undefined at x, we can say that F(x) = L. 


Basic facts about functions: Verifying that you can prove the following 
results is an excellent way to brush up on functions: 


e fF: S ~ TandG: T — U are one-to-one functions, then their 
composition H : S —> U defined as H(s) = G(F(s)) is also one to 
one. 


e If F : S — Tis one to one, then there exists an onto function 


G : T — S such that G(F(s)) = s for every s € S. a "0 

e IfG: T — S is onto then there exists a one-to-one function F : S > © 
T such that G(F(s)) = s for every s € S. 

ge 


e If Sand T are non-empty finite sets then the following conditions 
are equivalent to one another: (a) |S| < |T|, (b) there is a one- 
to-one function F : S — T, and (c) there is an onto function 
G : T — S. These equivalences are in fact true even for infinite S 
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Figure 1.4: We can represent finite functions as a 
directed graph where we put an edge from = to 


and T. For infinite sets the condition (b) (or equivalently, (c)) is f(x). The onto condition corresponds to requiring 


the commonly accepted definition for |S| < |T|. 


that every vertex in the codomain of the function 
has in-degree at least one. The one-to-one condition 


corresponds to requiring that every vertex in the 
codomain of the function has in-degree at most one. In 


the examples above F is an onto function, G is one to 
one, and H is neither onto nor one to one. 


56 INTRODUCTION TO THEORETICAL COMPUTER SCIENCE 


Let us prove one of these facts as an example: 


Lemma 1.2 If 5,7 are non-empty sets and F : S — T is one to one, then 
there exists an onto function G : T — S such that G(F(s)) = s for 
every s E€ S. 


Proof. Choose some sọ € S. We will define the function G : T > Sas 
follows: for every t € T, if there is some s € S such that F(s) = t then 
set G(t) = s (the choice of s is well defined since by the one-to-one 
property of F, there cannot be two distinct s, s’ that both map to t). 
Otherwise, set G(t) = sọ. Now for every s € S, by the definition of G, 
ift = F(s) then G(t) = G(F(s)) = s. Moreover, this also shows that 
G is onto, since it means that for every s € S there is some t, namely 

t = F(s), such that G(t) = s. 


1.4.4 Graphs 
Graphs are ubiquitous in Computer Science, and many other fields as 
well. They are used to model a variety of data types including social 
networks, scheduling constraints, road networks, deep neural nets, 
gene interactions, correlations between observations, and a great 
many more. Formal definitions of several kinds of graphs are given 
next, but if you have not seen graphs before in a course, I urge you to 
read up on them in one of the sources mentioned in Section 1.9. 
Graphs come in two basic flavors: undirected and directed.’ 


Definition 1.3 — Undirected graphs. An undirected graph G = (V, E) con- 
sists of a set V of vertices and a set E of edges. Every edge is a size 
two subset of V. We say that two vertices u,v E V are neighbors, if 
the edge {u,v} is in E. 


Given this definition, we can define several other properties of 
graphs and their vertices. We define the degree of u to be the number 
of neighbors u has. A path in the graph is a tuple (ug,...,u,) € VEL, 
for some k > 0 such that u;,, is a neighbor of u; for every i € [k]. A 
simple path is a path (ug, ..., U,_,) where all the wu,’s are distinct. A cycle 
is a path (ug, -.. , up) Where ug = u,. We say that two vertices u,v € V 
are connected if either u = v or there is a path from (up, ..., up) where 


3 It is possible, and sometimes useful, to think of an 
undirected graph as the special case of a directed 
graph that has the special property that for every pair 
u, v either both the edges (u, v) and (v, u) are present 
or neither of them is. However, in many settings there 
is a significant difference between undirected and 
directed graphs, and so it’s typically best to think of 
them as separate categories. 


Figure 1.5: An example of an undirected and a di- 
rected graph. The undirected graph has vertex set 
{1, 2,3, 4} and edge set {{1, 2}, {2, 3}, {2, 4}}. The 
directed graph has vertex set {a, b, c} and the edge 
set {(a, b), (b, c), (c, a), (a, c)}. 


Ug = wand u, = v. We say that the graph G is connected if every pair of 
vertices in it is connected. 

Here are some basic facts about undirected graphs. We give some 
informal arguments below, but leave the full proofs as exercises (the 
proofs can be found in many of the resources listed in Section 1.9). 


Lemma 1.4 In any undirected graph G = (V, E), the sum of the degrees 
of all vertices is equal to twice the number of edges. 


Lemma 1.4 can be shown by seeing that every edge {u,v} con- 
tributes twice to the sum of the degrees (once for u and the second 
time for v). 


Lemma 1.5 The connectivity relation is transitive, in the sense that if u is 
connected to v, and v is connected to w, then u is connected to w. 


Lemma 1.5 can be shown by simply attaching a path of the form 
(u, U1, Ug, -..,Uz_1,V) to a path of the form (v, uj, ---, Ut _1, W) to obtain 
the path (u, u,,...,Ug_1,U,U},---,U,,_1,w) that connects u to w. 


Lemma 1.6 For every undirected graph G = (V, E) and connected pair 
u,v, the shortest path from u to v is simple. In particular, for every 
connected pair there exists a simple path that connects them. 


Lemma 1.6 can be shown by “shortcutting” any non-simple path 
from u to v where the same vertex w appears twice to remove it (see 
Fig. 1.6). It is a good exercise to transforming this intuitive reasoning 
to a formal proof: 


2 ©) 


Solved Exercise 1.1 — Connected vertices have simple paths. Prove Lemma 1.6 


Solution: 
The proof follows the idea illustrated in Fig. 1.6. One complica- 
tion is that there can be more than one vertex that is visited twice 
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Figure 1.6: If there is a path from u to v ina graph 
that passes twice through a vertex w then we can 
“shortcut” it by removing the loop from w to itself to 
find a path from u to v that only passes once through 
w. 
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by a path, and so “shortcutting” might not necessarily result in a 
simple path; we deal with this by looking at a shortest path between 
u and v. Details follow. 

LetG = (V,E)bea graph and u and v in V be two connected 
vertices in G. We will prove that there is a simple path between u 
and v. Let k be the shortest length of a path between u and v and 
let P = (ug, Uy, Ug, .-.,Up—1; Up) be a k-length path from u to v 
(there can be more than one such path: if so we just choose one of 
them). (Thatis uo = u, up = v, and (up, up,1) € E forall £ € [k].) 
We claim that P is simple. Indeed, suppose otherwise that there is 
some vertex w that occurs twice in the path: w = u; and w = u, for 
somei < j. Then we can “shortcut” the path P by considering the 


path P’ = (Up, U1,- ,Ui-1, W, Uj41; -+ , Up) Obtained by taking the 
first i vertices of P (from up = 0 to the first occurrence of w) and 
the lastk — jones (from the vertex u,,, following the second oc- 


currence of w to u, = v). The path P’ is a valid path between u and 
v since every consecutive pair of vertices in it is connected by an 
edge (in particular, since w = u; = u;, both (u;—;, w) and (w, u;j41) 
are edges in E), but since the length of P’ is k — (j — i) < k, this 
contradicts the minimality of P. 


The concepts of degrees and connectivity extend naturally to di- 


rected graphs, defined as follows. 


Definition 1.8 — Directed graphs. A directed graph ŒG = (V, E) consists 
ofaset V andaset Æ C V x V of ordered pairs of V. We sometimes 
denote the edge (u,v) alsoasu — v.Iftheedgeu — vis present 
in the graph then we say that v is an out-neighbor of u and u is an 
in-neighbor of v. 


A directed graph might contain both u —> v and v — u in which 
case u will be both an in-neighbor and an out-neighbor of v and vice 
versa. The in-degree of u is the number of in-neighbors it has, and the 


out-degree of v is the number of out-neighbors it has. A path in the 
graph is a tuple (uo,...,u,) € V**!, for some k > 0 such that u;,, is an 
out-neighbor of u; for every i € [k]. As in the undirected case, a simple 
path is a path (up, ..., U,_,) where all the u,’s are distinct and a cycle 

is a path (ug, -.. , up) Where uy = up. One type of directed graphs we 
often care about is directed acyclic graphs or DAGs, which, as their name 
implies, are directed graphs without any cycles: 


Definition 1.9 — Directed Acyclic Graphs. We say that G = (V,E)isa 
directed acyclic graph (DAG) if it is a directed graph and there does 
not exist a list of vertices ug, u;,...,u, E V such that up = u, and 
for every i € [k], the edge u; > u,,, is in E. 


The lemmas we mentioned above have analogs for directed graphs. 
We again leave the proofs (which are essentially identical to their 
undirected analogs) as exercises. 


Lemma 1.10 In any directed graph G = (V, E), the sum of the in- 
degrees is equal to the sum of the out-degrees, which is equal to the 
number of edges. 


Lemma 1.11 In any directed graph G, if there is a path from u to v anda 
path from v to w, then there is a path from u to w. 


Lemma 1.12 For every directed graph G = (V, E) and a pair u, v such 
that there is a path from u to v, the shortest path from u to v is simple. 


1.4.5 Logic operators and quantifiers 
If P and Q are some statements that can be true or false, then P AND 
Q (denoted as P ^ Q) is a statement that is true if and only if both P 
and Q are true, and P OR Q (denoted as P v Q) is a statement that is 
true if and only if either P or Q is true. The negation of P, denoted as 
=P or P, is true if and only if P is false. 

Suppose that P(x) is a statement that depends on some parameter x 
(also sometimes known as an unbound variable) in the sense that for 
every instantiation of x with a value from some set S, P(x) is either 
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true or false. For example, x > 7 is a statement that is not a priori 
true or false, but becomes true or false whenever we instantiate x with 
some real number. We denote by Y„esP (x) the statement that is true 
if and only if P(x) is true for every x € S.4 We denote by 5,,-¢P(x) the 
statement that is true if and only if there exists some x € S such that 


P(x) is true. 

For example, the following is a formalization of the true statement 
that there exists a natural number n larger than 100 that is not divisi- 
ble by 3: 


Jnen(n > 100) A (Vyewk +k +k +n). 


”For sufficiently large n.” One expression that we will see come up 
time and again in this book is the claim that some statement P(n) is 
true “for sufficiently large n”. What this means is that there exists an 
integer Ny such that P(n) is true for every n > No. We can formalize 


this as Fy, enYn>n, P(n). 


1.4.6 Quantifiers for summations and products 

The following shorthands for summing up or taking products of sev- 
eral numbers are often convenient. If S = {so,...,5,_,} is a finite set 
and f : S — Risa function, then we write ` -g f(x) as shorthand for 


F(89) + F(81) + F (82) + + Fln) > 
and ||... f(z) as shorthand for 


(80) © F(81) © F(S2) +++ FCn): 


For example, the sum of the squares of all numbers from 1 to 100 
can be written as 


3 (1.1) 
i€{1,...,100} 
Since summing up over intervals of integers is so common, there 
is a special notation for it. For every two integers,a < b, D f(i) 
denotes `,- f(i) where S = {x € Z : a < x < b}. Hence, we can 
write the sum (1.1) as 


1.4.7 Parsing formulas: bound and free variables 

In mathematics, as in coding, we often have symbolic “variables” or 
“parameters”. It is important to be able to understand, given some 
formula, whether a given variable is bound or free in this formula. For 


‘In this book, we place the variable bound by a quan- 
tifier in a subscript and so write Yeg P(x). Many 
other texts do not use this subscript notation and so 
will write the same statement as Vx € S, P(x). 


example, in the following statement n is free but a and b are bound by 
the 4 quantifier: 


3, sen(@ #1) A (a £ n) A (n =a x b) (1.2) 


Since n is free, it can be set to any value, and the truth of the state- 
ment (1.2) depends on the value of n. For example, if n = 8 then (1.2) 
is true, but for n = 11 it is false. (Can you see why?) 

The same issue appears when parsing code. For example, in the 
following snippet from the C programming language 


for (int i= ; i<n ; i=i+1) { 
printf ("x"); 


the variable i is bound within the for block but the variable n is 
free. 

The main property of bound variables is that we can rename them 
(as long as the new name doesn’t conflict with another used variable) 
without changing the meaning of the statement. Thus for example the 
statement 


Ar yen(e FIA @ Fn) A(n =a x y) (1.3) 
is equivalent to (1.2) in the sense that it is true for exactly the same 
set of n’s. 
Similarly, the code 


for (int j=0 ; j<n ; j=jtl) { 
printf ("x"); 


produces the same result as the code above that used i instead of j. 


(R) 


MATHEMATICAL BACKGROUND 6l 


62 INTRODUCTION TO THEORETICAL COMPUTER SCIENCE 


1.4.8 Asymptotics and Big-O notation 
“log log log n has been proved to go to infinity, but has never been observed to 
do so.”, Anonymous, quoted by Carl Pomerance (2000) 


It is often very cumbersome to describe precisely quantities such 
as running time and is also not needed, since we are typically mostly 
interested in the “higher order terms”. That is, we want to understand 
the scaling behavior of the quantity as the input variable grows. For 
example, as far as running time goes, the difference between an n°- 
time algorithm and an n?-time one is much more significant than the 
difference between a 100n” + 10n time algorithm and a 10n? time 
algorithm. For this purpose, O-notation is extremely useful as a way 
to “declutter” our text and focus our attention on what really matters. 
For example, using O-notation, we can say that both 100n? + 10n 
and 10n? are simply O(n”) (which informally means “the same up to 
constant factors”), while n? = o(n°) (which informally means that n? 
is “much smaller than” n°). 

Generally (though still informally), if F, Œ are two functions map- 
ping natural numbers to non-negative reals, then “F = O(G)” means 
that F(n) < G(n) if we don’t care about constant factors, while 
“F = 0(G)” means that F is much smaller than G, in the sense that no 
matter by what constant factor we multiply F, if we take n to be large 


enough then G will be bigger (for this reason, sometimes F = 0(G) 
is written as F « G). We will write F = O(G) if F = O(G) and 

G = O(F), which one can think of as saying that F is the same as G if 
we don’t care about constant factors. More formally, we define Big-O 
notation as follows: 


Definition 1.15 — Big-O notation. Let R, = {x € R | x > 0} be the set 
of positive real numbers. For two functions F, G : N — R}, we say 
that F = O(G) if there exist numbers a, Ny E N such that F(n) < 
a- G(n) for every n > No. We say that F = O(G) if F = O(G) and 
G = O(F). We say that F = Q (G) if G = O(F). 

We say that F = o(G) if for everye > 0 there is some No such 
that F(n) < eG(n) foreveryn > No. Wesay that F = w(G) if 
G = o(F). 


It’s often convenient to use “anonymous functions” in the context of 
O-notation. For example, when we write a statement such as F'(n) = 
O(n?), we mean that F = O(G) where G is the function defined by 
G(n) = n?. Chapter 7 in Jim Apsnes’ notes on discrete math provides 
a good summary of O notation; see also this tutorial for a gentler and 
more programmer-oriented introduction. 

O is not equality. Using the equality sign for O-notation is extremely 
common, but is somewhat of a misnomer, since a statement such as 
F = O(G) really means that F is in the set {G’ : dy. S-t. Yp> yG (n) < 
cG(n)}. If anything, it makes more sense to use inequalities and write 
F < O(G)and F > Q(G), reserving equality for F = O(G), and 
so we will sometimes use this notation too, but since the equality 


notation is quite firmly entrenched we often stick to it as well. (Some 
texts write F € O(G) instead of F = O(G), but we will not use this 
notation.) Despite the misleading equality sign, you should remember 
that a statement such as F = O(G) means that F is “at most” G in 
some rough sense when we ignore constants, and a statement such as 
F = Q(G) means that F is “at least” G in the same rough sense. 


1.4.9 Some “rules of thumb” for Big-O notation 
There are some simple heuristics that can help when trying to com- 
pare two functions F and G: 


e Multiplicative constants don’t matter in O-notation, and so if 
F(n) = O(G(n)) then 100F (n) = O(G(n)). 


e When adding two functions, we only care about the larger one. For 
example, for the purpose of O-notation, n? + 100n? is the same as 
n®, and in general in any polynomial, we only care about the larger 
exponent. 
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Figure 1.7: If F(n) = o(G(n)) then for sufficiently 
large n, F(n) will be smaller than G(n). For example, 
if Algorithm A runs in time 1000 - n + 10° and 
Algorithm B runs in time 0.01 - n? then even though 
B might be more efficient for smaller inputs, when 
the inputs get sufficiently large, A will run much faster 
than B. 
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e For every two constants a,b > 0, n“ = O(n?) if and only if a < b, 
and n° = o(n?) if and only if a < b. For example, combining the two 
observations above, 100n? + 10n + 100 = o(n?). 


e Polynomial is always smaller than exponential: n° = 0(2"’) for 
every two constants a > 0 and e > 0 even if e is much smaller than 
a. For example, 100n1° = 0(2V”). 


e Similarly, logarithmic is always smaller than polynomial: (log n)* 
(which we write as log” n) is o(n‘) for every two constants a, > 0. 
For example, combining the observations above, 100n? log °° n= 


o(n3). 


(R) 


1.5 PROOFS 


Many people think of mathematical proofs as a sequence of logical 
deductions that starts from some axioms and ultimately arrives at 
a conclusion. In fact, some dictionaries define proofs that way. This 
is not entirely wrong, but at its essence, a mathematical proof of a 
statement X is simply an argument that convinces the reader that X is 
true beyond a shadow of a doubt. 

To produce such a proof you need to: 


1. Understand precisely what X means. 
2. Convince yourself that X is true. 


3. Write your reasoning down in plain, precise and concise English 
(using formulas or notation only when they help clarity). 
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In many cases, the first part is the most important one. Understand- 
ing what a statement means is oftentimes more than halfway towards 
understanding why it is true. In the third part, to convince the reader 
beyond a shadow of a doubt, we will often want to break down the 
reasoning to “basic steps”, where each basic step is simple enough 
to be “self-evident”. The combination of all steps yields the desired 
statement. 


1.5.1 Proofs and programs 

There is a great deal of similarity between the process of writing proofs 
and that of writing programs, and both require a similar set of skills. 
Writing a program involves: 


1. Understanding what is the task we want the program to achieve. 


2. Convincing yourself that the task can be achieved by a computer, 
perhaps by planning on a whiteboard or notepad how you will 
break it up into simpler tasks. 


3. Converting this plan into code that a compiler or interpreter can 
understand, by breaking up each task into a sequence of the basic 
operations of some programming language. 


In programs as in proofs, step 1 is often the most important one. 
A key difference is that the reader for proofs is a human being and 
the reader for programs is a computer. (This difference is eroding 
with time as more proofs are being written in a machine verifiable form; 
moreover, to ensure correctness and maintainability of programs, it 
is important that they can be read and understood by humans.) Thus 
our emphasis is on readability and having a clear logical flow for our 
proof (which is not a bad idea for programs as well). When writing a 
proof, you should think of your audience as an intelligent but highly 
skeptical and somewhat petty reader, that will “call foul” at every step 
that is not well justified. 


1.5.2 Proof writing style 

A mathematical proof is a piece of writing, but it is a specific genre 

of writing with certain conventions and preferred styles. As in any 
writing, practice makes perfect, and it is also important to revise your 
drafts for clarity. 

In a proof for the statement X, all the text between the words 
“Proof:” and “QED” should be focused on establishing that X is true. 
Digressions, examples, or ruminations should be kept outside these 
two words, so they do not confuse the reader. The proof should have 
a clear logical flow in the sense that every sentence or equation in it 
should have some purpose and it should be crystal-clear to the reader 
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what this purpose is. When you write a proof, for every equation or 
sentence you include, ask yourself: 


1. Is this sentence or equation stating that some statement is true? 


2. Ifso, does this statement follow from the previous steps, or are we 
going to establish it in the next step? 


3. What is the role of this sentence or equation? Is it one step towards 
proving the original statement, or is it a step towards proving some 
intermediate claim that you have stated before? 


4. Finally, would the answers to questions 1-3 be clear to the reader? 
If not, then you should reorder, rephrase, or add explanations. 


Some helpful resources on mathematical writing include this hand- 
out by Lee, this handout by Hutching, as well as several of the excel- 
lent handouts in Stanford’s CS 103 class. 


1.5.3 Patterns in proofs 
“Tf it was so, it might be; and if it were so, it would be; but as it isn’t, it ain't. 
That’s logic.”, Lewis Carroll, Through the looking-glass. 


Just like in programming, there are several common patterns of 
proofs that occur time and again. Here are some examples: 


Proofs by contradiction: One way to prove that X is true is to show 

that if X was false it would result in a contradiction. Such proofs 
often start with a sentence such as “Suppose, towards a contradiction, 
that X is false” and end with deriving some contradiction (such as a 
violation of one of the assumptions in the theorem statement). Here is 
an example: 


Lemma 1.17 There are no natural numbers a, b such that v2 = = 


Proof. Suppose, towards a contradiction that this is false, and so let 
a € N be the smallest number such that there exists some b € N 
satisfying V2 = ¢. Squaring this equation we get that 2 = a?/b? or 
a” = 2b? (x). But this means that a? is even, and since the product of 
two odd numbers is odd, it means that a is even as well, or in other 
words, a = 2a’ for some a’ € N. Yet plugging this into (x) shows that 
4a’? = 2b? which means b? = 2a’? is an even number as well. By the 
same considerations as above we get that b is even and hence a/2 and 
a/2 


b/2 are two natural numbers satisfying va = V2, contradicting the 


minimality of a. 
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Proofs of a universal statement: Often we want to prove a statement X of 
the form “Every object of type O has property P.” Such proofs often 
start with a sentence such as “Let o be an object of type O” and end by 
showing that o has the property P. Here is a simple example: 


Lemma 1.18 For every natural number n € N, either n or n + 1 is even. 


Proof. Letn € N be some number. If n/2 is a whole number then 

we are done, since then n = 2(n/2) and hence it is even. Otherwise, 

n/2+1/2is a whole number, and hence 2(n/2 + 1/2) = n + 1 is even. 
a 


Proofs of an implication: Another common case is that the statement X 
has the form “A implies B”. Such proofs often start with a sentence 
such as “Assume that A is true” and end with a derivation of B from 
A. Here is a simple example: 


Lemma 1.19 If b? > 4ac then there is a solution to the quadratic equa- 
tion az? + br +c =0. 


Proof. Suppose that b? > 4ac. Then d = b? — 4ac is a non-negative 
number and hence it has a square root s. Thus x = (—b + s)/(2a) 
satisfies 


ax? + br +c =a(—b + s)?/(4a”) + b(—b + s)/(2a) +c 


(1.4) 
= (b? — 2bs + s*)/(4a) + (—b? + bs)/(2a) +c. 


a 
Rearranging the terms of (1.4) we get 
s*/(4a) +c — b?/(4a) = (b? — 4ac)/(4a) + c — b?/(4a) =0 


Proofs of equivalence: If a statement has the form “A if and only if 

B” (often shortened as “A iff B”) then we need to prove both that A 
implies B and that B implies A. We call the implication that A implies 
B the “only if” direction, and the implication that B implies A the “if” 
direction. 


Proofs by combining intermediate claims: When a proof is more complex, 
it is often helpful to break it apart into several steps. That is, to prove 
the statement X, we might first prove statements X,,X5,and X, and 
then prove that X; ^A X, ^ X; implies X. (Recall that A denotes the 
logical AND operator.) 


Proofs by case distinction: This is a special case of the above, where to 
prove a statement X we split into several cases C4, ... , Cp, and prove 
that (a) the cases are exhaustive, in the sense that one of the cases C; 
must happen and (b) go one by one and prove that each one of the 
cases C; implies the result X that we are after. 
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Proofs by induction: We discuss induction and give an example in 
Section 1.6.1 below. We can think of such proofs as a variant of the 
above, where we have an unbounded number of intermediate claims 
Xos X1; Xo,...,X,, and we prove that XQ is true, as well as that Xg 
implies X,, and that Xo ^ X, implies X,, and so on and so forth. The 
website for CMU course 15-251 contains a useful handout on potential 
pitfalls when making proofs by induction. 


*Without loss of generality (w.l.o.g)”: This term can be initially quite con- 
fusing. It is essentially a way to simplify proofs by case distinctions. 
The idea is that if Case 1 is equal to Case 2 up to a change of variables 
or a similar transformation, then the proof of Case 1 will also imply 
the proof of Case 2. It is always a statement that should be viewed 
with suspicion. Whenever you see it in a proof, ask yourself if you 
understand why the assumption made is truly without loss of gen- 
erality, and when you use it, try to see if the use is indeed justified. 
When writing a proof, sometimes it might be easiest to simply repeat 
the proof of the second case (adding a remark that the proof is very 
similar to the first one). 


(R) 


1.6 EXTENDED EXAMPLE: TOPOLOGICAL SORTING 


In this section we will prove the following: every directed acyclic 
graph (DAG, see Definition 1.9) can be arranged in layers so that for 
all directed edges u — v, the layer of v is larger than the layer of u. 
This result is known as topological sorting and is used in many appli- 
cations, including task scheduling, build systems, software package 
management, spreadsheet cell calculations, and many others (see 

Fig. 1.8). In fact, we will also use it ourselves later on in this book. 


Operating 
Systems 


Layer 0 Layer 1 Layer 2 


We start with the following definition. A layering of a directed 
graph is a way to assign for every vertex v a natural number 
(corresponding to its layer), such that v’s in-neighbors are in 
lower-numbered layers than v, and v’s out-neighbors are in 
higher-numbered layers. The formal definition is as follows: 


Definition 1.21 — Layering of a DAG. Let G = (V, E) bea directed graph. 
A layering of Gis a function f : V — N such that for every edge 
u— vot G, f(u) < fw). 


In this section we prove that a directed graph is acyclic if and only if 
it has a valid layering. 


Theorem 1.22 — Topological Sort. Let G be a directed graph. Then G is 
acyclic if and only if there exists a layering f of G. 


To prove such a theorem, we need to first understand what it 
means. Since it is an “if and only if” statement, Theorem 1.22 corre- 
sponds to two statements: 


Lemma 1.23 For every directed graph G, if G is acyclic then it has a 
layering. 

Lemma 1.24 For every directed graph G, if G has a layering, then it is 
acyclic. 
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Figure 1.8: An example of topological sorting. We con- 
sider a directed graph corresponding to a prerequisite 
graph of the courses in some Computer Science pro- 
gram. The edge u — v means that the course u is a 
prerequisite for the course v. A layering or “topologi- 
cal sorting” of this graph is the same as mapping the 
courses to semesters so that if we decide to take the 
course v in semester f(v), then we have already taken 
all the prerequisites for v (i.e., its in-neighbors) in 
prior semesters. 
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To prove Theorem 1.22 we need to prove both Lemma 1.23 and 
Lemma 1.24. Lemma 1.24 is actually not that hard to prove. Intuitively, 
if G contains a cycle, then it cannot be the case that all edges on the 
cycle increase in layer number, since if we travel along the cycle at 
some point we must come back to the place we started from. The 
formal proof is as follows: 


Proof. Let G = (V, E) bea directed graph and let f : V > Nbea 
layering of G as per Definition 1.21 . Suppose, towards a contradiction, 
that G is not acyclic, and hence there exists some cycle ug, U4, ... , Uz 
such that uy = u, and for every i € [k] the edge u; — u; is present in 
G. Since f is a layering, for every i € [k], f(u;) < f(ui}1) which means 
that 


fu) < f(u) <+ < flux) 


but this is a contradiction since ug = u; and hence f(uo) = flug). 


Lemma 1.23 corresponds to the more difficult (and useful) direc- 
tion. To prove it, we need to show how, given an arbitrary DAG G, we 
can come up with a layering of the vertices of G so that all edges “go 


ny 


up”. 


1.6.1 Mathematical induction 
There are several ways to prove Lemma 1.23. One approach to do is 
to start by proving it for small graphs, such as graphs with 1, 2 or 3 
vertices (see Fig. 1.9, for which we can check all the cases, and then 
try to extend the proof for larger graphs). The technical term for this 
proof approach is proof by induction. 

Induction is simply an application of the self-evident Modus Ponens 
rule that says that if 


(a) P is true 
and 


(b) P implies Q 
then Q is true. 


D 5 oy 
SS Lo 


Figure 1.9: Some examples of DAGs of one, two and 
three vertices, and valid ways to assign layers to the 
vertices. 


In the setting of proofs by induction we typically have a statement 
Q(k) that is parameterized by some integer k, and we prove that (a) 
Q(0) is true, and (b) For every k > 0, if Q(0),..., Q(k — 1) are all true 
then Q(k) is true. (Usually proving (b) is the hard part, though there 
are examples where the “base case” (a) is quite subtle.) By applying 
Modus Ponens, we can deduce from (a) and (b) that Q(1) is true. 
Once we did so, since we now know that both Q(0) and Q(1) are true, 
then we can use this and (b) to deduce (again using Modus Ponens) 
that Q(2) is true. We can repeat the same reasoning again and again 
to obtain that Q(k) is true for every k. The statement (a) is called the 
“base case”, while (b) is called the “inductive step”. The assumption 
in (b) that Q(i) holds for i < kis called the “inductive hypothesis”. 
(The form of induction described here is sometimes called “strong 
induction” as opposed to “weak induction” where we replace (b) 
by the statement (b’) that if Q(k — 1) is true then Q(k) is true; weak 
induction can be thought of as the special case of strong induction 
where we don’t use the assumption that Q(0), ... ,Q(k — 2) are true.) 


Q 


1.6.2 Proving the result by induction 

There are several ways to prove Lemma 1.23 by induction. We will 
use induction on the number n of vertices, and so we will define the 
statement Q(n) as follows: 


Q(n) is “For every DAG G = (V, E) with n vertices, there is a layering of G.” 
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The statement for Q(0) (where the graph contains no vertices) is 
trivial. Thus it will suffice to prove the following: for every n > 0, if 
Q(n — 1) is true then Q(n) is true. 

To do so, we need to somehow find a way, given a graph G of n 
vertices, to reduce the task of finding a layering for G into the task of 
finding a layering for some other graph G” of n— 1 vertices. The idea is 
that we will find a source of G: a vertex v that has no in-neighbors. We 
can then assign to v the layer 0, and layer the remaining vertices using 
the inductive hypothesis in layers 1, 2, .... 

The above is the intuition behind the proof of Lemma 1.23, but 
when writing the formal proof below, we use the benefit of hind- 
sight, and try to streamline what was a messy journey into a linear 
and easy-to-follow flow of logic that starts with the word “Proof:” 
and ends with “QED” or the symbol E’ Discussions, examples and 
digressions can be very insightful, but we keep them outside the space 
delimited between these two words, where (as described by this ex- 
cellent handout) “every sentence must be load-bearing”. Just like we 
do in programming, we can break the proof into little “subroutines” 
or “functions” (known as lemmas or claims in math language), which 
will be smaller statements that help us prove the main result. How- 
ever, the proof should be structured in a way that ensures that it is 
always crystal-clear to the reader in what stage we are of the proof. 
The reader should be able to tell what the role of every sentence is in 
the proof and which part it belongs to. We now present the formal 
proof of Lemma 1.23. 


Proof of Lemma 1.23. Let G = (V, E) be a DAG and n = |V| be the 
number of its vertices. We prove the lemma by induction on n. The 
base case is n = 0 where there are no vertices, and so the statement is 
trivially true.ć For the case of n > 0, we make the inductive hypothesis 
that every DAG G’ of at most n — 1 vertices has a layering. 

We make the following claim: 

Claim: G must contain a vertex v of in-degree zero. 

Proof of Claim: Suppose otherwise that every vertex v € V has an 
in-neighbor. Let vo be some vertex of G, let v, be an in-neighbor of vo, 
v be an in-neighbor of v4, and continue in this way for n steps until 
we construct a list vo, v1, --- , Up such that for every i € [n], v,,, is an 
in-neighbor of v;, or in other words the edge v;,, — v; is present in the 
graph. Since there are only n vertices in this graph, one of the n + 1 
vertices in this sequence must repeat itself, and so there exists i < j 
such that v; = v;. But then the sequence v; => v;_; > = > v; is a cycle 
in G, contradicting our assumption that it is acyclic. (QED Claim) 

Given the claim, we can let vg be some vertex of in-degree zero in 
G, and let G” be the graph obtained by removing vo from G. G” has 


5 QED stands for “quod erat demonstrandum”, which 
is Latin for “what was to be demonstrated” or “the 
very thing it was required to have shown”. 


° Using n = 0 as the base case is logically valid, but 
can be confusing. If you find the trivial n = 0 case 
to be confusing, you can always directly verify the 
statement for n = 1 and then use both n = 0 and 
n = 1 as the base cases. 


n — 1 vertices and hence per the inductive hypothesis has a layering 
F: (V \ {up }) > N. We define f : V > Nas follows: 


H(v)+1 vžv 
=f . 
0 v= v 
We claim that f is a valid layering, namely that for every edge u — 
v, f(u) < f(v). To prove this, we split into cases: 


e Case 1: u # vo, v # vo. In this case the edge u — v exists in the 
graph G’ and hence by the inductive hypothesis f’(u) < f’(v) 
which implies that f’(u) +1 < f’(v) +1. 


e Case 2: u = vo, v + vo. In this case f(u) = Oand f(v) = f’(v) +1 > 
0. 


e Case 3: u # Up, U = vo. This case can’t happen since vg does not 
have in-neighbors. 


e Case 4: u = vo, v = vo. This case again can’t happen since it means 
that vg is its own-neighbor — it is involved in a self loop which is a 
form cycle that is disallowed in an acyclic graph. 


Thus, f is a valid layering for G which completes the proof. 


1.6.3 Minimality and uniqueness 
Theorem 1.22 guarantees that for every DAG G = (V, E) there exists 
some layering f : V — N but this layering is not necessarily unique. 


For example, if f : V — N is a valid layering of the graph then so is 
the function f’ defined as f’(v) = 2 - f(v). However, it turns out that 
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the minimal layering is unique. A minimal layering is one where every 
vertex is given the smallest layer number possible. We now formally 
define minimality and state the uniqueness theorem: 


Theorem 1.26 — Minimal layering is unique. Let G = (V, E) be a DAG. We 
say that a layering f : V — N is minimal if for every vertex v € V, if 
v has no in-neighbors then f(v) = 0 and if v has in-neighbors then 
there exists an in-neighbor u of v such that f(u) = f(v) — 1. 

For every layering f,g : V — N of G, if both f and g are minimal 
then f =g. 


The definition of minimality in Theorem 1.26 implies that for every 
vertexv € V, we cannot move it to a lower layer without making 
the layering invalid. If v is a source (i.e., has in-degree zero) then 
a minimal layering f must put it in layer 0, and for every other v, if 
f(v) = i, then we cannot modify this to set f(v) < i — 1 since there 
is an-neighbor u of v satisfying f(u) = i — 1. What Theorem 1.26 
says is that a minimal layering f is unique in the sense that every other 
minimal layering is equal to f. 


Proof Idea: 

The idea is to prove the theorem by induction on the layers. If f and 
g are minimal then they must agree on the source vertices, since both 
f and g should assign these vertices to layer 0. We can then show that 
if f and g agree up to layer i — 1, then the minimality property implies 
that they need to agree in layer i as well. In the actual proof we use 
a small trick to save on writing. Rather than proving the statement 
that f = g (or in other words that f(v) = g(v) for every v € V), 
we prove the weaker statement that f(v) < g(v) for every v € V. 
(This is a weaker statement since the condition that f(v) is lesser or 
equal than to g(v) is implied by the condition that f(v) is equal to 
g(v).) However, since f and g are just labels we give to two minimal 
layerings, by simply changing the names “f” and “g” the same proof 
also shows that g(v) < f(v) for every v € V and hence that f = g. 

* 


Proof of Theorem 1.26. Let G = (V, E) bea DAG and f,g : V —> N be 
two minimal valid layerings of G. We will prove that for every v € V, 
f(v) < g(v). Since we didn’t assume anything about f, g except their 
minimality, the same proof will imply that for every v € V, g(v) < f(v) 
and hence that f(v) = g(v) for every v € V, which is what we needed 
to show. 

We will prove that f(v) < g(v) for every v € V by induction on 
i = f(v). The casei = 0 is immediate: since in this case f(v) = 0, 
g(v) must be at least f(v). For the casei > 0, by the minimality of f, 
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if f(v) = i then there must exist some in-neighbor u of v such that 
f(u) = i — 1. By the induction hypothesis we get that g(u) > i — 1, and 
since g is a valid layering it must hold that g(v) > g(w) which means 
that g(v) > i = fw). 

a 


1.7 THIS BOOK: NOTATION AND CONVENTIONS 


Most of the notation we use in this book is standard and is used in 
most mathematical texts. The main points where we diverge are: 


e We index the natural numbers N starting with 0 (though many 
other texts, especially in computer science, do the same). 


e We also index the set [n] starting with 0, and hence define it as 
{0,...,2—1}. In other texts it is often defined as {1,...,}. Similarly, 
we index our strings starting with 0, and hence a string x € {0,1}” 


is written as %pX1 °° En—1- 


e Ifnis a natural number then 1” does not equal the number 1 but 
rather this is the length n string 11---1 (that is a string of n ones). 
Similarly, 0” refers to the length n string 00---0. 


e Partial functions are functions that are not necessarily defined on 
all inputs. When we write f : A —> B this means that f is a total 
function unless we say otherwise. When we want to emphasize that 
f can be a partial function, we will sometimes write f : A >, B. 


e As we will see later on in the course, we will mostly describe our 
computational problems in terms of computing a Boolean function 
f : {0,1}* — {0,1}. In contrast, many other textbooks refer to the 
same task as deciding a language L C {0,1}*. These two viewpoints 
are equivalent, since for every set L C {0,1}* there is a correspond- 
ing function F such that F(x) = 1 if and only if x € L. Computing 
partial functions corresponds to the task known in the literature as a 
solving promise problem. Because the language notation is so preva- 
lent in other textbooks, we will occasionally remind the reader of 
this correspondence. 


We use [x] and |z] for the “ceiling” and “floor” operators that 
correspond to “rounding up” or “rounding down” a number to the 
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nearest integer. We use (x mod y) to denote the “remainder” of x 
when divided by y. That is, (a mod y) = x — y|a/y|. In context 
when an integer is expected we'll typically “silently round” the 
quantities to an integer. For example, if we say that z is a string of 
length yn then this means that z is of length | yn |]. (We round up 
for the sake of convention, but in most such cases, it will not make a 
difference whether we round up or down.) 


e Like most Computer Science texts, we default to the logarithm in 
base two. Thus, logn is the same as log, n. 


e We will also use the notation f(n) = poly(n) as a shorthand for 
f(n) = n? (i.e., as shorthand for saying that there are some 
constants a,b such that f(n) < a - n? for every sufficiently large 
n). Similarly, we will use f(n) = polylog(n) as shorthand for 
f(n) = poly(logn) (i.e., as shorthand for saying that there are 
some constants a, b such that f(n) < a- (logn)? for every sufficiently 
large n). 


e As is often the case in mathematical literature, we use the apostro- 
phe character to enrich our set of identifiers. Typically if x denotes 
some object, then x’, x”, etc. will denote other objects of the same 


type. 


e To save on “cognitive load” we will often use round constants such 
as 10, 100, 1000 in the statements of both theorems and problem 
set questions. When you see such a “round” constant, you can 
typically assume that it has no special significance and was just 
chosen arbitrarily. For example, if you see a theorem of the form 
“Algorithm A takes at most 1000 - n? steps to compute function F on 
inputs of length n” then probably the number 1000 is an arbitrary 
sufficiently large constant, and one could prove the same theorem 
with a bound of the form c - n? for a constant c that is smaller than 
1000. Similarly, if a problem asks you to prove that some quantity is 
at least n/100, it is quite possible that in truth the quantity is at least 
n/d for some constant d that is smaller than 100. 


1.7.1 Variable name conventions 

Like programming, mathematics is full of variables. Whenever you see 
a variable, it is always important to keep track of what its type is (e.g., 
whether the variable is a number, a string, a function, a graph, etc.). 
To make this easier, we try to stick to certain conventions and consis- 
tently use certain identifiers for variables of the same type. Some of 
these conventions are listed in Section 1.7.1 below. These conventions 
are not immutable laws and we might occasionally deviate from them. 


Also, such conventions do not replace the need to explicitly declare for 
each new variable the type of object that it denotes. 


Table 1.2: Conventions for identifiers in this book 


Identifier Often denotes object of type 


i,j,k¢,mm Natural numbers (i.e., in N = {0, 1, 2,...}) 

€,0 Small positive real numbers (very close to 0) 

x,y,z,w Typically strings in {0, 1}* though sometimes numbers or 
other objects. We often identify an object with its 
representation as a string. 


G A graph. The set of G’s vertices is typically denoted by V. 
Often V = [n]. The set of G’s edges is typically denoted by 
E. 

S Set 

f.g, h Functions. We often (though not always) use lowercase 
identifiers for finite functions, which map {0, 1}” to {0,1 }” 
(often m = 1). 


F,G,H Infinite (unbounded input) functions mapping {0, 1}* to 
{0,1}* or {0, 1}* to {0, 1} for some m. Based on context, 
the identifiers G, H are sometimes used to denote 
functions and sometimes graphs. 

A,B,C Boolean circuits 

M,N Turing machines 

P,Q Programs 


T A function mapping N to N that corresponds to a time 
bound. 
c A positive number (often an unspecified constant; e.g., 


T(n) = O(n) corresponds to the existence of c s.t. 
T(n) < c- n every n > 0). We sometimes use a,b in a 
similar way. 

>) Finite set (often used as the alphabet for a set of strings). 


1.7.2 Some idioms 

Mathematical texts often employ certain conventions or “idioms”. 
Some examples of such idioms that we use in this text include the 
following: 


e “Let X be...”, “let X denote ...”, or “let X = ...”: These are all 
different ways for us to say that we are defining the symbol X to 
stand for whatever expression is in the .... When X is a property of 
some objects we might define X by writing something along the 
lines of “We say that ... has the property X if ....”. While we often 
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try to define terms before they are used, sometimes a mathematical 
sentence reads easier if we use a term before defining it, in which 
case we add “Where X is ...” to explain how X is defined in the 
preceding expression. 


e Quantifiers: Mathematical texts involve many quantifiers such as 
“for all” and “exists”. We sometimes spell these in words as in “for 
alli € N” or “there is x € {0,1}*”, and sometimes use the formal 


symbols V and 3. It is important to keep track of which variable is 
quantified in what way the dependencies between the variables. For 
example, a sentence fragment such as “for every k > 0 there exists 
n” means that n can be chosen in a way that depends on k. The order 
of quantifiers is important. For example, the following is a true 
statement: “for every natural number k > 1 there exists a prime number 
n such that n divides k.” In contrast, the following statement is false: 
“there exists a prime number n such that for every natural number k > 1, 
n divides k.” 


e Numbered equations, theorems, definitions: To keep track of all 
the terms we define and statements we prove, we often assign them 
a (typically numeric) label, and then refer back to them in other 
parts of the text. 


e (i.e.,), (e.g.,): Mathematical texts tend to contain quite a few of 
these expressions. We use X (i.e., Y ) in cases where Y is equivalent 
to X and X (e.g., Y) in cases where Y is an example of X (e.g., one 
can use phrases such as “a natural number (i.e., a non-negative 


n 


integer)” or “a natural number (e.g., 7)”). 


e “Thus”, “Therefore” , “We get that”: This means that the following 
sentence is implied by the preceding one, as in “The n-vertex graph 
G is connected. Therefore it contains at least n — 1 edges.” We 
sometimes use “indeed” to indicate that the following text justifies 
the claim that was made in the preceding sentence as in “The n- 
vertex graph G has at least n — 1 edges. Indeed, this follows since G is 
connected.” 


e Constants: In Computer Science, we typically care about how our 
algorithms’ resource consumption (such as running time) scales 
with certain quantities (such as the length of the input). We refer to 
quantities that do not depend on the length of the input as constants 
and so often use statements such as “there exists a constant c > 0 such 
that for every n € N, Algorithm A runs in at most c - n? steps on inputs of 
length n.” The qualifier “constant” for c is not strictly needed but is 
added to emphasize that c here is a fixed number independent of n. 
In fact sometimes, to reduce cognitive load, we will simply replace c 
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by a sufficiently large round number such as 10, 100, or 1000, or use 
O-notation and write “Algorithm A runs in O(n?) time.” 


1.8 EXERCISES 


Exercise 1.1 — Logical expressions. a. Write a logical expression p(x) 
involving the variables £9, £1, £3 and the operators A (AND), V 
(OR), and — (NOT), such that y(x) is true if the majority of the 
inputs are True. 


b. Write a logical expression y(x) involving the variables xy, £1, £3 
and the operators A (AND), V (OR), and = (NOT), such that y(x) 
is true if the sum a x, (identifying “true” with 1 and “false” 
with 0) is odd. 


Exercise 1.2 — Quantifiers. Use the logical quantifiers V (for all), 3 (there 
exists), as well as A, V, — and the arithmetic operations +, x, =, >, < to 
write the following: 


a. An expression y(n, k) such that for every natural number n, k, 
p(n, k) is true if and only if k divides n. 


b. An expression y(n) such that for every natural number n, y(n) is 
true if and only if n is a power of three. 
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Exercise 1.3 Describe the following statement in English words: 
VnewopsnVa,b E N(a x b + p) V (a=1). 


Exercise 1.4 — Set construction notation. Describe in words the following 
sets: 


a. S= {x € {0,1} °°: Vie{o,....99} 1 = Loi} 


yesi 


Exercise 1.5 — Existence of one to one mappings. For each one of the fol- 

lowing pairs of sets (S, T), prove or disprove the following statement: 

there is a one to one function f mapping S to T. 

a. Letn > 10. S = {0,1}" and T = [n] x [n] x [n]. 

b. Letn > 10. S is the set of all functions mapping {0, 1}” to {0, 1}. 
T = {0,1}. 


c. Letn > 100. S = {k € [n] | k is prime}, T = {0, 1} llesn—11, 


Exercise 1.6 — Inclusion Exclusion. a. Let A, B be finite sets. Prove that 
|AU B| = |A| + |B|- |ANA BI. 

b. Let Ag,..., Ap—1 be finite sets. Prove that |A U = U Apal > 
Do Al Doenge Ay) 

c. Let Ap,...,A,_, be finite subsets of {1, ... , n}, such that |A;| = m for 


every i € |k]. Prove that if k > 100n, then there exist two distinct 
sets A;, A; s-t. |A; N A,| > m?/(10n). 


| 
Exercise 1.7 Prove that if S,T are finite and F : S — T is one to one 
then |S| < |T|. 

| 
Exercise 1.8 Prove that if S,T are finite and F : S — T is onto then 
|S| > IT]. 

E 
Exercise 1.9 Prove that for every finite S,T, there are (|T| + 1)!5! partial 
functions from S to T. 

E 


Exercise 1.10 Suppose that {S }nen is a sequence such that Sọ < 10 and 
for n > 1 S, < 55) n) + 2n. Prove by induction that S, < 100n log n for 
5 


every n. 
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Exercise 1.11 Prove that for every undirected graph G of 100 vertices, 
if every vertex has degree at most 4, then there exists a subset S of at 
least 20 vertices such that no two vertices in S are neighbors of one 
another. 

a 


Exercise 1.12 — O-notation. For every pair of functions F, G below, deter- 
mine which of the following relations holds: F = O(G), F = Q(G), 
F =0(G) or F = w(G). 


a. F(n) =n, G(n) = 100n. 

b. F(n) =n, G(n) = yn. 

c F(n) =nlogn, G(n) = 28(”))?, 
d. F(n) = yn, G(n) = 2v8”, 


— n — 90.1n n\ of 
= F(n) an (joan) G(n) =2 (where Ga) is the number of k-sized 7 One way to do this is to use Stirling’s approximation 
subsets of a set of size n). See footnote for hint.” for the factorial function. 


Exercise 1.13 Give an example of a pair of functions F,G : N — N such 
that neither F = O(G) nor G = O(F) holds. 


Exercise 1.14 Prove that for every undirected graph G on n vertices, if G 
has at least n edges then G contains a cycle. 


Exercise 1.15 Prove that for every undirected graph G of 1000 vertices, 
if every vertex has degree at most 4, then there exists a subset S of at 
least 200 vertices such that no two vertices in S are neighbors of one 
another. 


1.9 BIBLIOGRAPHICAL NOTES 


The heading “A Mathematician’s Apology”, refers to Hardy’s classic 
book [Har41]. Even when Hardy is wrong, he is very much worth 
reading. 

There are many online sources for the mathematical background 
needed for this book. In particular, the lecture notes for MIT 6.042 
“Mathematics for Computer Science” [LLM18] are extremely com- 
prehensive, and videos and assignments for this course are available 
online. Similarly, Berkeley CS 70: “Discrete Mathematics and Proba- 
bility Theory” has extensive lecture notes online. 
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Other sources for discrete mathematics are Rosen [Ros19] and 
Jim Aspens’ online book [ Asp18]. Lewis and Zax [LZ19], as well 
as the online book of Fleck [Fle18], give a more gentle overview of 
much of the same material. Solow [Sol14] is a good introduction 
to proof reading and writing. Kun [Kun18] gives an introduction 
to mathematics aimed at readers with programming backgrounds. 
Stanford’s CS 103 course has a wonderful collection of handouts on 
mathematical proof techniques and discrete mathematics. 

The word graph in the sense of Definition 1.3 was coined by the 
mathematician Sylvester in 1878 in analogy with the chemical graphs 
used to visualize molecules. There is an unfortunate confusion be- 
tween this term and the more common usage of the word “graph” as 
a way to plot data, and in particular a plot of some function f(x) as a 
function of x. One way to relate these two notions is to identify every 
function f : A — B with the directed graph G ; over the vertex set 
V = AUB such that G; contains the edge x — f(x) for every x € A. In 
a graph G, constructed in this way, every vertex in A has out-degree 
equal to one. If the function f is one to one then every vertex in B has 
in-degree at most one. If the function f is onto then every vertex in B 
has in-degree at least one. If f is a bijection then every vertex in B has 
in-degree exactly equal to one. 

Carl Pomerance’s quote is taken from the home page of Doron 
Zeilberger. 


Learning Objectives: 


e Distinguish between specification and 
implementation, or equivalently between 
mathematical functions and 
algorithms /programs. 


Representing an object as a string (often of 
zeroes and ones). 


Examples of representations for common 
objects such as numbers, vectors, lists, and 


2 graphs. 
Prefix-free representations. 


Cantor’s Theorem: The real numbers cannot 


Co mp u ta tio n an d Rep resen ta tio n be represented exactly as finite strings. 


“The alphabet (sic) was a great invention, which enabled men (sic) to store 
and to learn with little effort what others had learned the hard way — that is, to 
learn from books rather than from direct, possibly painful, contact with the real 
world.”, B.F. Skinner 


“The name of the song is called ‘HADDOCK’S EYES.” [said the Knight] 
“Oh, that’s the name of the song, is it?” Alice said, trying to feel interested. 


“No, you don’t understand,” the Knight said, looking a little vexed. “That’s 
what the name is CALLED. The name really is ‘THE AGED AGED MAN.” 


“Then I ought to have said ‘That's what the SONG is called’?” Alice cor- 
rected herself. 


“No, you oughtn’t: that’s quite another thing! The SONG is called ‘WAYS 
AND MEANS’: but that’s only what it’s CALLED, you know!” 


“Well, what IS the song, then?” said Alice, who was by this time com- 
pletely bewildered. 


“I was coming to that,” the Knight said. “The song really IS ‘A-SITTING ON 
A GATE’: and the tune’s my own invention.” 


Lewis Carroll, Through the Looking-Glass 


To a first approximation, computation is a process that maps an input 
to an output. 

When discussing computation, it is essential to separate the ques- 
tion of what is the task we need to perform (i.e., the specification) from 1 A ot o Jd r. ł 
the question of how we achieve this task (i.e., the implementation). 
For example, as we've seen, there is more than one way to achieve the 
computational task of computing the product of two integers. 

In this chapter we focus on the what part, namely defining com- 
putational tasks. For starters, we need to define the inputs and out- EE T EE E E lansing 
puts. Capturing all the potential inputs and outputs that we might process that maps an input to an output 
ever want to compute seems challenging, since computation today is 
applied to a wide variety of objects. We do not compute merely on 
numbers, but also on texts, images, videos, connection graphs of social 


> — 
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networks, MRI scans, gene data, and even other programs. We will 
represent all these objects as strings of zeroes and ones, that is objects 
such as 0011101 or 1011 or any other finite list of 1’s and 0’s. (This 
choice is for convenience: there is nothing “holy” about zeroes and 
ones, and we could have used any other finite collection of symbols.) 

Today, we are so used to the notion of digital representation that 
we are not surprised by the existence of such an encoding. But it is 
actually a deep insight with significant implications. Many animals 
can convey a particular fear or desire, but what is unique about hu- 
mans is language: we use a finite collection of basic symbols to describe 
a potentially unlimited range of experiences. Language allows trans- 
mission of information over both time and space and enables soci- 
eties that span a great many people and accumulate a body of shared 
knowledge over time. 

Over the last several decades, we have seen a revolution in what we 
can represent and convey in digital form. We can capture experiences 
with almost perfect fidelity, and disseminate it essentially instanta- 
neously to an unlimited audience. Moreover, once information is in 
digital form, we can compute over it, and gain insights from data that 
were not accessible in prior times. At the heart of this revolution is the 
simple but profound observation that we can represent an unbounded 
variety of objects using a finite set of symbols (and in fact using only 
the two symbols @ and 1). 

In later chapters, we will typically take such representations for 
granted, and hence use expressions such as “program P takes x as 
input” when « might be a number, a vector, a graph, or any other 
object, when we really mean that P takes as input the representation of 
x as a binary string. However, in this chapter we will dwell a bit more 
on how we can construct such representations. 


Figure 2.2: We represent numbers, texts, images, net- 
works and many other objects using strings of zeroes 
and ones. Writing the zeroes and ones themselves in 
green font over a black background is optional. 


2.1 DEFINING REPRESENTATIONS 


Every time we store numbers, images, sounds, databases, or other ob- 
jects on a computer, what we actually store in the computer’s memory 
is the representation of these objects. Moreover, the idea of representa- 
tion is not restricted to digital computers. When we write down text or 
make a drawing we are representing ideas or experiences as sequences 
of symbols (which might as well be strings of zeroes and ones). Even 
our brain does not store the actual sensory inputs we experience, but 
rather only a representation of them. 

To use objects such as numbers, images, graphs, or others as inputs 
for computation, we need to define precisely how to represent these 
objects as binary strings. A representation scheme is a way to map an ob- 
ject x to a binary string E(x) € {0,1}*. For example, a representation 
scheme for natural numbers is a function E : N —> {0,1}*. Of course, 
we cannot merely represent all numbers as the string “0011” (for ex- 
ample). A minimal requirement is that if two numbers x and x’ are 
different then they would be represented by different strings. Another 
way to say this is that we require the encoding function E to be one to 
one. 
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2.1.1 Representing natural numbers 

We now show how we can represent natural numbers as binary 
strings. Over the years people have represented numbers in a variety 
of ways, including Roman numerals, tally marks, our own Hindu- 
Arabic decimal system, and many others. We can use any one of 
those as well as many others to represent a number as a string (see 
Fig. 2.3). However, for the sake of concreteness, we use the binary 
basis as our default representation of natural numbers as strings. 

For example, we represent the number six as the string 110 since 
1-2?+1-2'+0-2° = 6, and similarly we represent the number thirty- 
five as the string y = 100011 which satisfies ys y; 28-1 = 35. 
Some more examples are given in the table below. 


Table 2.1: Representing numbers in the binary basis. The left- 
hand column contains representations of natural numbers in the 
decimal basis, while the right-hand column contains representa- 
tions of the same numbers in the binary basis. 


Number (decimal representation) Number (binary representation) 


0 0 

1 1 

2 10 

5 101 

16 10000 

40 101000 

53 110101 

389 110000101 
3750 111010100110 


If n is even, then the least significant digit of n’s binary representa- 
tion is 0, while if n is odd then this digit equals 1. Just like the number 
[n/10| corresponds to “chopping off” the least significant decimal 
digit (e.g., [457/10] = [45.7] = 45), the number |n/2| corresponds 
to the “chopping off” the least significant binary digit. Hence the bi- 
nary representation can be formally defined as the following function 
NtS : N > {0, 1}* (NtS stands for “natural numbers to strings”): 


0 n=0 
NtS(n) = 41 n=1 (2.1) 
NtS(|n/2])parity(n) n> 1 
where parity : N — {0,1} is the function defined as parity(n) = 0 
if n is even and parity(n) = 1if nis odd, and as usual, for strings 
x,y € {0,1}*, zy denotes the concatenation of x and y. The function 


T 


Figure 2.3: Representing each one the digits 
0,1,2,...,9asa12 x 8 bitmap image, which can be 
thought of as a string in {0, 1}°°. Using this scheme 
we can represent a natural number x of n decimal 
digits as a string in {0, 1}°°”. Image taken from blog 
post of A. C. Andersen. 


NtsS is defined recursively: for every n > 1 we define rep(n) in terms 
of the representation of the smaller number |n/2|. It is also possible to 
define NtS non-recursively, see Exercise 2.2. 

Throughout most of this book, the particular choices of represen- 
tation of numbers as binary strings would not matter much: we just 
need to know that such a representation exists. In fact, for many of our 
purposes we can even use the simpler representation of mapping a 
natural number n to the length-n all-zero string 0”. 


@) 


Remark 2.1 — Binary representation in python (optional). 
We can implement the binary representation in Python 
as follows: 


def NtS(n):# natural numbers to strings 
if n > i; 
return NtS(n // 2) + str(n % 2) 
else: 
return str(n % 2) 


print (NtS(236)) 
# 11101100 


print (NtS(19)) 
# 10011 


We can also use Python to implement the inverse 
transformation, mapping a string back to the natural 
number it represents. 


def StN(x):# String to number 
k = len(x)-1 
return sum(int(xLi])*(2**(k-i)) for i in 
o range(k+1)) 


print (StN(NtS(236))) 
# 236 


Remark 2.2 — Programming examples. In this book, 
we sometimes use code examples as in Remark 2.1. 
The point is always to emphasize that certain com- 
putations can be achieved concretely, rather than 
illustrating the features of Python or any other pro- 
gramming language. Indeed, one of the messages of 
this book is that all programming languages are in 
a certain precise sense equivalent to one another, and 
hence we could have just as well used JavaScript, C, 
COBOL, Visual Basic or even BrainF*ck. This book 
is not about programming, and it is absolutely OK if 
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2.1.2 Meaning of representations (discussion) 

It is natural for us to think of 236 as the “actual” number, and of 
11101100 as “merely” its representation. However, for most Euro- 
peans in the middle ages CCXXXVI would be the “actual” number and 
236 (if they have even heard about it) would be the weird Hindu- 
Arabic positional representation.! When our AI robot overlords ma- 
terialize, they will probably think of 11101100 as the “actual” number 
and of 236 as “merely” a representation that they need to use when 
they give commands to humans. 

So what is the “actual” number? This is a question that philoso- 
phers of mathematics have pondered throughout history. Plato ar- 
gued that mathematical objects exist in some ideal sphere of existence 
(that to a certain extent is more “real” than the world we perceive 
via our senses, as this latter world is merely the shadow of this ideal 
sphere). In Plato’s vision, the symbols 236 are merely notation for 
some ideal object, that, in homage to the late musician, we can refer to 
as “the number commonly represented by 236”. 

The Austrian philosopher Ludwig Wittgenstein, on the other hand, 
argued that mathematical objects do not exist at all, and the only 
things that exist are the actual marks on paper that make up 236, 
11101100 or CCXXXVI. In Wittgenstein’s view, mathematics is merely 
about formal manipulation of symbols that do not have any inherent 
meaning. You can think of the “actual” number as (somewhat recur- 
sively) “that thing which is common to 236, 11101100 and CCXXXVI 
and all other past and future representations that are meant to capture 
the same object”. 

While reading this book, you are free to choose your own phi- 
losophy of mathematics, as long as you maintain the distinction be- 
tween the mathematical objects themselves and the various particular 
choices of representing them, whether as splotches of ink, pixels on a 
screen, zeroes and ones, or any other form. 


2.2 REPRESENTATIONS BEYOND NATURAL NUMBERS 


We have seen that natural numbers can be represented as binary 
strings. We now show that the same is true for other types of objects, 
including (potentially negative) integers, rational numbers, vectors, 
lists, graphs and many others. In many instances, choosing the “right” 
string representation for a piece of data is highly non-trivial, and find- 
ing the “best” one (e.g., most compact, best fidelity, most efficiently 
manipulable, robust to errors, most informative features, etc.) is the 


1 While the Babylonians already invented a positional 
system much earlier, the decimal positional system 
we use today was invented by Indian mathematicians 
around the third century. Arab mathematicians took 
it up in the 8th century. It first received significant 
attention in Europe with the publication of the 1202 
book “Liber Abaci” by Leonardo of Pisa, also known as 
Fibonacci, but it did not displace Roman numerals in 
common usage until the 15th century. 


object of intense research. But for now, we focus on presenting some 
simple representations for various objects that we would like to use as 
inputs and outputs for computation. 


2.2.1 Representing (potentially negative) integers 

Since we can represent natural numbers as strings, we can 

represent the full set of integers (i.e., members of the set 

Z = {...,—3, —2, —1, 0, +1, +2, +3, ...} ) by adding one more bit 

that represents the sign. To represent a (potentially negative) number 
m, we prepend to the representation of the natural number |m] a bit o 
that equals 0 if m > 0 and equals 1 ifm < 0. Formally, we define the 
function ZtS : Z — {0,1}* as follows 


Z4S(m) = ONtS(m) m20 
7 1 NtS(—m) m<0O 


where N¢tS is defined as in (2.1). 

While the encoding function of a representation needs to be one 
to one, it does not have to be onto. For example, in the representation 
above there is no number that is represented by the empty string 
but it is still a fine representation, since every integer is represented 
uniquely by some string. 


(R) 


2.2.2 Two’s complement representation (optional) 
Section 2.2.1’s approach of representing an integer using a specific 
“sign bit” is known as the Signed Magnitude Representation and was 
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used in some early computers. However, the two’s complement rep- 
resentation is much more common in practice. The two’s complement 
representation of an integer k in the set {—2”, —2” + 1,...,2" — 1} is the 
string ZtS,, (k) of length n + 1 defined as follows: 


NtS,,.1(k) O<k<2"-1 
ZtS,,(k) = 
Nis GQ +k) —2” <k<-—1 


where NtS,(m) denotes the standard binary representation of a num- 
berm € {0,...,2°} as string of length £, padded with leading zeros 

as needed. For example, if n = 3 then ZtS3(1) = NtS,4(1) = 0001, 
ZtS3(2) = NtS,(2) = 0010, ZtS;(—1) = NtS,(16 — 1) = 1111, and 
ZtS3(—8) = NtS,(16 — 8) = 1000. If k is a negative number larger than 
or equal to —2” then 2”*! + kis a number between 2” and 2”*! — 1. 
Hence the two’s complement representation of such a number k is a 
string of length n + 1 with its first digit equal to 1. 

Another way to say this is that we represent a potentially negative 
number k € {—2”,...,2”—1} as the non-negative number k mod 2”*? 
(see also Fig. 2.4). This means that if two (potentially negative) num- 
bers k and k’ are not too large (i.e. |k| + |k’| < 2”*'), then we can 
compute the representation of k + k’ by adding modulo 2”*" the rep- 
resentations of k and k’ as if they were non-negative integers. This 
property of the two’s complement representation is its main attraction 
since, depending on their architectures, microprocessors can often 
perform arithmetic operations modulo 2" very efficiently (for certain 
values of w such as 32 and 64). Many systems leave it to the pro- 
grammer to check that values are not too large and will carry out this 
modular arithmetic regardless of the size of the numbers involved. For 
this reason, in some systems adding two large positive numbers can 
result in a negative number (e.g., adding 2” — 100 and 2” — 200 might 
result in —300 since (2”*' — 300) mod 2”*! = —300, see also Fig. 2.4). 


2.2.3 Rational numbers and representing pairs of strings 
We can represent a rational number of the form a/b by represent- 
ing the two numbers a and b. However, merely concatenating the 
representations of a and b will not work. For example, the binary rep- 
resentation of 4 is 100 and the binary representation of 43 is 101011, 
but the concatenation 100101011 of these strings is also the concatena- 
tion of the representation 10010 of 18 and the representation 1011 of 
11. Hence, if we used such simple concatenation then we would not 
be able to tell if the string 100101011 is supposed to represent 4/43 or 
18/11. 

We tackle this by giving a general representation for pairs of strings. 
If we were using a pen and paper, we would just use a separator sym- 
bol such as || to represent, for example, the pair consisting of the num- 


af o 
6(0)=0000 1 


(15) = 1111 b(1) = 0001 


-2 2 #include <stdio.h> 
b(14) = 1110 (2) = 0010 

int main(void) { 

int a = 2147483548; 
int b = 2147483448; 
printf ("Nd", a+b); 
return 0; 


-3 3 
b(13) = 1101 sann 
‘ 


-4 b012) = 1100 b(4) = 0100 
-5 b(11) = 1011 b(s)= 0101} 


b(6) = 0110 


b(7) = 0111 


6(8)=1000 7 
-8 


b(9) = 1001 


Figure 2.4: In the two’s complement representation 

we represent a potentially negative integer k € 
{-2",...,2” —1}asann-+ 1 length string using the 
binary representation of the integer k mod 2”+!. On 
the left-hand side: this representation for n = 3 (the 
red integers are the numbers being represented by 

the blue binary strings). If a microprocessor does not 
check for overflows, adding the two positive numbers 
6 and 5 might result in the negative number —5 (since 
—5 mod 16 = 11. The right-hand side is a C program 
that will on some 32 bit architecture print a negative 
number after adding two positive numbers. (Integer 
overflow in C is considered undefined behavior which 
means the result of this program, including whether 
it runs or crashes, could differ depending on the 
architecture, compiler, and even compiler options and 
version. ) 
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bers represented by 10 and 110001 as the length-9 string “10]110001”. 
In other words, there is a one to one map F from pairs of strings 
x,y E {0,1}* into a single string z over the alphabet © = {0,1, ||} 
(in other words, z € %*). Using such separators is similar to the 
way we use spaces and punctuation to separate words in English. By 
adding a little redundancy, we achieve the same effect in the digital 
domain. We can map the three-element set © to the three-element set 
{00, 11,01} c {0,1}? ina one-to-one fashion, and hence encode a 
length n string z € S* as a length 2n string w € {0, 1}*. 

Our final representation for rational numbers is obtained by com- 
posing the following steps: 


1. Representing a non-negative rational number as a pair of natural 
numbers. 


2. Representing a natural number by a string via the binary represen- 
tation. 


3. Combining 1 and 2 to obtain a representation of a rational number 
as a pair of strings. 


4. Representing a pair of strings over {0, 1} as a single string over 
© = {0,1, |} 


5. Representing a string over © as a longer string over {0, 1}. 


m Example 2.4 — Representing a rational number as a string. Consider the 
rationalnumberr = —5/8. We represent —5 as 1101 and +8 as 
01000, and so we can represent r as the pair of strings (1101, 01000) 
and represent this pair as the length 10 string 1101||01000 over 

the alphabet {0, 1, ||}. Now, applying the map0 > 00,1 > 11, 
| ++ 01, we can represent the latter string as the length 20 string 
s = 11110011010011000000 over the alphabet {0, 1}. 


The same idea can be used to represent triples of strings, quadru- 
ples, and so on as a string. Indeed, this is one instance of a very gen- 
eral principle that we use time and again in both the theory and prac- 
tice of computer science (for example, in Object Oriented program- 


ming): 


Repeating the same idea, once we can represent objects of type T, 
we can also represent lists of lists of such objects, and even lists of lists 
of lists and so on and so forth. We will come back to this point when 
we discuss prefix free encoding in Section 2.5.2. 
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2.3 REPRESENTING REAL NUMBERS 


The set of real numbers R contains all numbers including positive, 
negative, and fractional, as well as irrational numbers such as 7 or e. 
Every real number can be approximated by a rational number, and 
thus we can represent every real number z by a rational number a/b 
that is very close to x. For example, we can represent 7 by 22/7 within 
an error of about 10~°. If we want a smaller error (e.g., about 1074) 
then we can use 311/99, and so on and so forth. 


| 1 [ooooor0. L200 020072.0 00.06 


Li ı f y 
1 £ k 


Sign (E {0,1} fraction / mantissa (= 1. b4b; ... by, € [1,2]) 


exponent (E {—2’,...,2° — 1}) 


The above representation of real numbers via rational numbers that 
approximate them is a fine choice for a representation scheme. How- 
ever, typically in computing applications, it is more common to use 
the floating-point representation scheme (see Fig. 2.5) to represent real 
numbers. In the floating-point representation scheme we represent 
x € R by the pair (b, e) of (positive or negative) integers of some pre- 
scribed sizes (determined by the desired accuracy) such that b x 2° 
is closest to x. Floating-point representation is the base-two version 
of scientific notation, where one represents a number y € R as its 
approximation of the form b x 10° for b, e. It is called “floating-point” 
because we can think of the number b as specifying a sequence of 
binary digits, and e as describing the location of the “binary point” 
within this sequence. The use of floating representation is the reason 
why in many programming systems, printing the expression 0. 1+0. 2 
will result in @. 30000000000000004 and not 0.3, see here, here and 
here for more. 

The reader might be (rightly) worried about the fact that the 
floating-point representation (or the rational number one) can only 
approximately represent real numbers. In many (though not all) com- 
putational applications, one can make the accuracy tight enough so 
that this does not affect the final result, though sometimes we do need 
to be careful. Indeed, floating-point bugs can sometimes be no jok- 
ing matter. For example, floating-point rounding errors have been 
implicated in the failure of a U.S. Patriot missile to intercept an Iraqi 
Scud missile, costing 28 lives, as well as a 100 million pound error in 
computing payouts to British pensioners. 


Figure 2.5: The floating-point representation of a real 
number x € R is its approximation as a number of 
the form cb - 2° where o € {+1}, e 
negative) integer, and b is a rational number between 
1 and 2 expressed as a binary fraction 1.b, by ... by 

for some b,,..., bẹ € {0,1} (thatisb = 1 + 6,/2+ 
by /4 +... + b,/2"). Commonly-used floating-point 
representations fix the numbers £ and k of bits to 
represent e and b respectively. In the example above, 
assuming we use two’s complement representation 
for e, the number represented is —1 x 2° x (1+1/2+ 


1/4 + 1/64 + 1/512) = 


HEY, CHECK IT OUT: @™~ aT 1S 
19.999099979._ THATS WEIRD. 
YEAH. THAT'S HOW I 
GOT KICKED OUT OF 
THE ACM IN COLLEGE. 


RR 


56.5625. 


DURING A COMPETITION, I 
TOLD THE PROGRAMMERS ON 
OUR TEAM THAT e-m 
WAS A STANDARD TEST OF FLOATING- 
POINT HANDLERS -- IT WOULD 
COME OUT To 20 UNLESS 
THEY HAD ROUNDING ERRORS. 


RR 


is an (potentially 


YEAH, THEY DUS THROUGH 
» | HALF THEIR ALGORITHMS 
LOOKING FOR THE BUG 
BEFORE THEY FIGURED 
IT OUT. 


THATS 


RA 


Figure 2.6: XKCD cartoon on floating-point arithmetic. 


COMPUTATION AND REPRESENTATION 


2.4 CANTOR’S THEOREM, COUNTABLE SETS, AND STRING REP- 
RESENTATIONS OF THE REAL NUMBERS 
“For any collection of fruits, we can make more fruit salads than there are 
fruits. If not, we could label each salad with a different fruit, and consider the 


salad of all fruits not in their salad. The label of this salad is in it if and only if 
it is not.”, Martha Storey. 


Given the issues with floating-point approximations for real num- 
bers, a natural question is whether it is possible to represent real num- 
bers exactly as strings. Unfortunately, the following theorem shows 
that this cannot be done: 


Theorem 2.5 — Cantor’s Theorem. There does not exist a one-to-one 
function RtS : R > {0,1}*. ? 


? RtS stands for “real numbers to strings”. 


Countable sets. We say that a set S is countable if there is an onto 
map C : N > S, or in other words, we can write S' as the sequence 
C(0), C(1), C(2), .... Since the binary representation yields an onto 
map from {0, 1}* to N, and the composition of two onto maps is onto, 
a set S' is countable iff there is an onto map from {0, 1}* to S. Using 
the basic properties of functions (see Section 1.4.3), a set is countable 
if and only if there is a one-to-one function from 5S to {0, 1}*. Hence, 
we can rephrase Theorem 2.5 as follows: 


Theorem 2.6 — Cantor’s Theorem (equivalent statement). The reals are un- 
countable. That is, there does not exist an onto function NtR : N > 
R. 


Theorem 2.6 was proven by Georg Cantor in 1874. This result (and 
the theory around it) was quite shocking to mathematicians at the 
time. By showing that there is no one-to-one map from R to {0,1}* (or 
N), Cantor showed that these two infinite sets have “different forms of 
infinity” and that the set of real numbers R is in some sense “bigger” 
than the infinite set {0,1}*. The notion that there are “shades of infin- 
ity” was deeply disturbing to mathematicians and philosophers at the 
time. The philosopher Ludwig Wittgenstein (whom we mentioned be- 
fore) called Cantor’s results “utter nonsense” and “laughable.” Others 
thought they were even worse than that. Leopold Kronecker called 
Cantor a “corrupter of youth,” while Henri Poincaré said that Can- 
tor’s ideas “should be banished from mathematics once and for all.” 
The tide eventually turned, and these days Cantor’s work is univer- 
sally accepted as the cornerstone of set theory and the foundations of 
mathematics. As David Hilbert said in 1925, “No one shall expel us from 
the paradise which Cantor has created for us.” As we will see later in this 
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book, Cantor’s ideas also play a huge role in the theory of computa- 
tion. 

Now that we have discussed Theorem 2.5’s importance, let us see 
the proof. It is achieved in two steps: 


1. Define some infinite set X for which it is easier for us to prove that 
X is not countable (namely, it’s easier for us to prove that there is 
no one-to-one function from % to {0, 1}*). 


2. Prove that there is a one-to-one function G mapping X to R. 


We can use a proof by contradiction to show that these two facts 
together imply Theorem 2.5. Specifically, if we assume (towards the 
sake of contradiction) that there exists some one-to-one F mapping R 
to {0, 1}*, then the function x ++ F(G(x)) obtained by composing F 
with the function G from Step 2 above would be a one-to-one function 
from Y to {0, 1}*, which contradicts what we proved in Step 1! 

To turn this idea into a full proof of Theorem 2.5 we need to: 


e Define the set X. 
e Prove that there is no one-to-one function from X to {0, 1}* 


e Prove that there is a one-to-one function from X to R. 


We now proceed to do precisely that. That is, we will define the set 
{0, 1}°°, which will play the role of 1, and then state and prove two 
lemmas that show that this set satisfies our two desired properties. 


| Definition 2.7 We denote by {0,1}° the set {f | f : N— {0,1}}. 


That is, {0,1}°° is a set of functions, and a function f is in {0, 1}°° 
iff its domain is N and its codomain is {0, 1}. We can also think of 
{0,1} as the set of all infinite sequences of bits, since a function f : 

N — {0,1} can be identified with the sequence (f(0), f(1), f(2),...). 
The following two lemmas show that {0,1}°° can play the role of X to 
establish Theorem 2.5. 


Lemma 2.8 There does not exist a one-to-one map FtS : {0,1}° > 
{0,1}*.3 


Lemma 2.9 There does exist a one-to-one map FtR : {0,1} > R.4 


As we've seen above, Lemma 2.8 and Lemma 2.9 together imply 
Theorem 2.5. To repeat the argument more formally, suppose, for 
the sake of contradiction, that there did exist a one-to-one function 
RtS : R > {0,1}*. By Lemma 2.9, there exists a one-to-one function 
FtR : {0,1}°° — R. Thus, under this assumption, since the composi- 
tion of two one-to-one functions is one-to-one (see Exercise 2.12), the 


3 FtS stands for “functions to strings”. 


+ FtR stands for “functions to reals.” 


function FtS : {0,1}°° — {0,1}* defined as FtS(f) = RtS(FtR(f)) 
will be one to one, contradicting Lemma 2.8. See Fig. 2.7 for a graphi- 
cal illustration of this argument. 


Exists 1-1 map FtR (Lemma 2.9) ¢ 


Hypothetical 1-1 map RtS 


IN —~fS 


{0,1}? = 
{f If:N > {0,1}} 
Infinite sequences of 
bits 


{0,1}" 
Strings 
(finite sequences of 


R= (—00, +00) 
Real Numbers 


No 1-1 map FtS (Lemma 2.8) W 


Now all that is left is to prove these two lemmas. We start by prov- 
ing Lemma 2.8 which is really the heart of Theorem 2.5. 


Hypothetical onto function StF: {0,1}* — {0,1}”: 


nput: Output: 
x € {0,1} StF (x) € {0,1}° 
i | 

[ ti 
n(x) x StF (x) (0) | StF(x)(1)| StF(x)(2)} StF(x)(3)| StF(x)(4)| StF(x)(5) 
0 0 
1 0 1 
2 1 0 
3 00 0 0 1 0 1 1 
4 01 1 1 0 1 0 1 
5 10 0 1 0 1 1 0 


d(n) = 1 —n* diagonal entry. 


d= (4,0,0,1,1,1, ...) E {0,1}” is different from every row of table 
Hence d is not in the image of StF! 


Warm-up: ”Baby Cantor”. The proof of Lemma 2.8 is rather subtle. One 
way to get intuition for it is to consider the following finite statement 
“there is no onto function f : {0,...,99} — {0,1}10”. Of course 

we know it’s true since the set {0, 1}'°° is bigger than the set [100], 
but let’s see a direct proof. For every f : {0,...,99} — {0,1}1°°, we 
can define the string d € {0,1}1°° as follows: d = (1 — f(0) 9,1 — 
f(1),,---,1 — f(99)o9)- If f was onto, then there would exist some 


n € [100] such that f(n) = d, but we claim that no such n exists. 
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Figure 2.7: We prove Theorem 2.5 by combining 
Lemma 2.8 and Lemma 2.9. Lemma 2.9, which uses 
standard calculus tools, shows the existence of a 
one-to-one map FtR from the set {0, 1}°° to the real 
numbers. So, if a hypothetical one-to-one map RtS : 
R — {0, 1}* existed, then we could compose them 

to get a one-to-one map FtS : {0,1} — {0,1%}. 
Yet this contradicts Lemma 2.8- the heart of the proof- 
which rules out the existence of such a map. 


Figure 2.8: We construct a function d such that d # 
StF (a) for every x € {0,1}* by ensuring that 
d(n(x)) # StF(x)(n(a)) for every x € {0,1}* 
with lexicographic order n(x). We can think of this 
as building a table where the columns correspond 

to numbers m € N and the rows correspond to 

x € {0, 1}* (sorted according to n(x)). If the entry 
in the «-th row and the m-th column corresponds to 
g(m)) where g = StF (x) then d is obtained by going 
over the “diagonal” elements in this table (the entries 
corresponding to the x-th row and n(a)-th column) 
and ensuring that d(a)(n(x)) # StF (x)(x). 
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Indeed, if there was such n, then the n-th coordinate of d would equal 
f(n),, but by definition this coordinate equals 1 — f(n),,. See also a 
“proof by code” of this statement. 


Proof of Lemma 2.8. We will prove that there does not exist an onto 
function StF : {0,1}* — {0,1}°°. This implies the lemma since 
for every two sets A and B, there exists an onto function from A to 
B if and only if there exists a one-to-one function from B to A (see 
Lemma 1.2). 

The technique of this proof is known as the “diagonal argument” 
and is illustrated in Fig. 2.8. We assume, towards a contradiction, that 
there exists such a function StF : {0,1}* > {0,1}°°. We will show 
that StF is not onto by demonstrating a function d € {0, 1}°° such that 
d + StF(a) for every x € {0,1}*. Consider the lexicographic ordering 
of binary strings (i.e., "",0,1,00,01,...). For every n € N, we let z,, be the 


n-th string in this order. That is 7) = "", xı = 0, z = 1 and so on and 
so forth. We define the function d € {0,1}° as follows: 


d(n) = 1—StF(z,,)(n) 


for every n € N. That is, to compute d on input n € N, we first com- 
pute g = StF (x), where z,, € {0, 1}* is the n-th string in the lexico- 
graphical ordering. Since g € {0,1}°°, it is a function mapping N to 
{0,1}. The value d(n) is defined to be the negation of g(n). 

The definition of the function d is a bit subtle. One way to think 
about it is to imagine the function StF as being specified by an in- 
finitely long table, in which every row corresponds to a string x € 
{0, 1}* (with strings sorted in lexicographic order), and contains the 
sequence StF (x)(0), StF(x)(1), StF(x)(2), .... The diagonal elements in 
this table are the values 


StF ("")(0), StF(0)(1), StF(1)(2), StF'(00) (3), StF(01)(4), ... 


which correspond to the elements StF'(z,,)(n) in the n-th row and 
n-th column of this table for n = 0,1,2, .... The function d we defined 
above maps every n € N to the negation of the n-th diagonal value. 

To complete the proof that StF is not onto we need to show that 
d + StF (x) for every x € {0,1}*. Indeed, let x € {0, 1}* be some string 
and let g = StF (x). If n is the position of x in the lexicographical 
order then by construction d(n) = 1 — g(n) # g(n) which means that 
g + d which is what we wanted to prove. 

a 


To complete the proof of Theorem 2.5, we need to show Lemma 2.9. 
This requires some calculus background but is otherwise straightfor- 
ward. If you have not had much experience with limits of a real series 
before, then the formal proof below might be a little hard to follow. 
This part is not the core of Cantor’s argument, nor are such limits 
important to the remainder of this book, so you can feel free to take 
Lemma 2.9 on faith and skip the proof. 


Proof Idea: 

We define FtR(f) to be the number between 0 and 2 whose dec- 
imal expansion is f(0).f(1)f(2) ..., or in other words FtR(f) = 
ar f(i) -10~*. If f and g are two distinct functions in {0,1}°, then 
there must be some input k in which they disagree. If we take the 
minimum such k, then the numbers f(0).f(1)f(2) ... f(k — 1) f(k)... 
and g(0).g(1)9(2) ... g(k) ... agree with each other all the way up to the 
k — 1-th digit after the decimal point, and disagree on the k-th digit. 
But then these numbers must be distinct. Concretely, if f(k) = 1 and 
g(k) = 0 then the first number is larger than the second, and otherwise 
(f(k) = Oand g(k) = 1) the first number is smaller than the second. 
In the proof we have to be a little careful since these are numbers with 
infinite expansions. For example, the number one half has two decimal 
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expansions 0.5 and 0.49999 ---. However, this issue does not come up 
here, since we restrict attention only to numbers with decimal expan- 
sions that do not involve the digit 9. 

* 


Proof of Lemma 2.9. For every f € {0,1}°°, we define FtR(f) to be the 
number whose decimal expansion is f(0).f(1) f(2) (3) .... Formally, 


FtR(f) = 3 f 107% (2.2) 


It is a known result in calculus (whose proof we will not repeat here) 
that the series on the right-hand side of (2.2) converges to a definite 
limit in R. 

We now prove that FtR is one to one. Let f, g be two distinct func- 
tions in {0, 1}°°. Since f and g are distinct, there must be some input 
on which they differ, and we define k to be the smallest such input 
and assume without loss of generality that f(k) = 0 and g(k) = 1. 
(Otherwise, if f(k) = 1 and g(k) = 0, then we can simply switch the 
roles of f and g.) The numbers FtR(f) and FtR(g) agree with each 
other up to the k—1-th digit up after the decimal point. Since this digit 
equals 0 for FtR(f) and equals 1 for FtR(g), we claim that FtR(g) is 
bigger than FtR(f) by at least 0.5 - 10-*. To see this note that the dif- 
ference FtR(g) — FtR(f) will be minimized if g(¢) = 0 for every £ > k 
and f(£) = 1 for every l > k, in which case (since f and g agree up to 
the k — 1-th digit) 


FtR(g) — FtR(f) = 10-* — 10-*-! — 10-*-2 —10-*-3—... (2.3) 


Since the infinite series Xa 10~* converges to 10/9, it follows that 
for every such f and g, FtR(g) — FtR(f) > 10-* —10-* + - (10/9) > 0. 
In particular we see that for every distinct f,g € {0,1}, FtR(f) + 
FtR(g), implying that the function FtR is one to one. 


2.4.1 Corollary: Boolean functions are uncountable 


Cantor’s Theorem yields the following corollary that we will use 
several times in this book: the set of all Boolean functions (mapping 
{0, 1}* to {0,1}) is not countable: 


Theorem 2.12 — Boolean functions are uncountable. Let ALL be the set of 
all functions F : {0,1}* — {0,1}. Then ALL is uncountable. Equiv- 
alently, there does not exist an onto map StALL : {0,1}* > ALL. 


Proof Idea: 

This is a direct consequence of Lemma 2.8, since we can use the 
binary representation to show a one-to-one map from {0, 1}°° to ALL. 
Hence the uncountability of {0,1}°° implies the uncountability of 
ALL. 

* 


Proof of Theorem 2.12. Since {0, 1}°° is uncountable, the result will 
follow by showing a one-to-one map from {0,1}°° to ALL. The reason 
is that the existence of such a map implies that if ALL was countable, 
and hence there was a one-to-one map from ALL to N, then there 
would have been a one-to-one map from {0, 1}°° to N, contradicting 
Lemma 2.8. 

We now show this one-to-one map. We simply map a function 
f € {0,1} to the function F : {0,1}* — {0,1} as follows. We let 
F(0) = f(0), F(1) = f(1), F(10) = f(2), F(11) = f(3) and so on and 
so forth. That is, for every x € {0,1}* that represents a natural number 
n in the binary basis, we define F(x) = f(n). If does not represent 
such a number (e.g., it has a leading zero), then we set F(x) = 0. 

This map is one-to-one since if f + g are two distinct elements in 
{0, 1}°°, then there must be some input n € N on which f(n) # g(n). 
But then if x € {0,1}* is the string representing n, we see that F(x) + 
G(x) where F is the function in ALL that f mapped to, and G is the 
function that g is mapped to. 
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2.4.2 Equivalent conditions for countability 

The results above establish many equivalent ways to phrase the fact 
that a set is countable. Specifically, the following statements are all 
equivalent: 


1. The set S is countable 

2. There exists an onto map from N to S 

3. There exists an onto map from {0, 1}* to S. 

4. There exists a one-to-one map from S to N 

5. There exists a one-to-one map from S to {0, 1}*. 


6. There exists an onto map from some countable set T to S. 


7. There exists a one-to-one map from S to some countable set T. 


2.5 REPRESENTING OBJECTS BEYOND NUMBERS 


Numbers are of course by no means the only objects that we can repre- 
sent as binary strings. A representation scheme for representing objects 
from some set O consists of an encoding function that maps an object in 
O to a string, and a decoding function that decodes a string back to an 
object in O. Formally, we make the following definition: 


Definition 2.13 — String representation. Let O be any set. A representation 
scheme for O is a pair of functions E, D where E : O — {0,1}*isa 
total one-to-one function, D : {0,1}* —, O isa (possibly partial) 
function, and such that D and E satisfy that D(E(o)) = o for every 
o € O. Eis known as the encoding function and D is known as the 
decoding function. 


Note that the condition D(E(o)) = o foreveryo € O implies 
that D is onto (can you see why?). It turns out that to construct a 
representation scheme we only need to find an encoding function. That 
is, every one-to-one encoding function has a corresponding decoding 
function, as shown in the following lemma: 


Lemma 2.14 Suppose that E : O — {0,1}* is one-to-one. Then there 
exists a function D : {0,1}* —> O such that D(E(o)) = o for every 
oc 0. 


Proof. Let og be some arbitrary element of O. For every x € {0,1}*, 
there exists either zero or a single o € O such that E(o) = x (otherwise 
E would not be one-to-one). We will define D(x) to equal oy in the 
first case and this single object o in the second case. By definition 
D(E(o)) = o for every o € O. 


2.5.1 Finite representations 


If O is finite, then we can represent every object in O as a string of 
length at most some number n. What is the value of n? Let us denote 
by {0,1}5” the set {x € {0,1}* : |x| < n} of strings of length at most n. 
The size of {0,1}5” is equal to 


[{0,1}°| + 1{0, 1}2] + [{0, 1}?| +--+ [{0, 1}"] = $ 2! = 24-1 
i=0 
using the standard formula for summing a geometric progression. 
To obtain a representation of objects in O as strings of length at 
most n we need to come up with a one-to-one function from 0 to 
{0, 1}<”. We can do so, if and only if |O| < 2”*! — 1 as is implied by 
the following lemma: 


Lemma 2.16 For every two non-empty finite sets S, T, there exists a 
one-to-one E : S > T if and only if |S] < |T]. 


Proof. Letk = |S| and m = |T| and so write the elements of S and 
TasS = {8 ,51,.-,S,,;;andT = {to,ty,...,t,,_;}. We need to 
show that there is a one-to-one function E : S —> T iff k < m. For 
the “if” direction, if k < m we can simply define E(s;) = t; for every 
i € [k]. Clearly for i # j, t; = E(s;) # E(s;) = t;,and hence this 
function is one-to-one. In the other direction, suppose that k > m and 


E : S —> T is some function. Then E cannot be one-to-one. Indeed, for 
i = 0,1,...,m — 1 let us “mark” the element t; = E(s;) in T. If t; was 
marked before, then we have found two objects in S mapping to the 
same element tj Otherwise, since T has m elements, when we get to 

i = m — 1 we mark all the objects in T. Hence, in this case, E(s,,,) must 
map to an element that was already marked before. (This observation 
is sometimes known as the “Pigeonhole Principle”: the principle that 
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if you have a pigeon coop with m holes and k > m pigeons, then there 
must be two pigeons in the same hole.) 
a 


2.5.2 Prefix-free encoding 

When showing a representation scheme for rational numbers, we used 
the “hack” of encoding the alphabet {0, 1, ||} to represent tuples of 
strings as a single string. This is a special case of the general paradigm 
of prefix-free encoding. The idea is the following: if our representation 
has the property that no string x representing an object o is a prefix 
(i.e., an initial substring) of a string y representing a different object 
o’, then we can represent a list of objects by merely concatenating the 
representations of all the list members. For example, because in En- 
glish every sentence ends with a punctuation mark such as a period, 
exclamation, or question mark, no sentence can be a prefix of another 
and so we can represent a list of sentences by merely concatenating 
the sentences one after the other. (English has some complications 
such as periods used for abbreviations (e.g., “e.g.”) or sentence quotes 
containing punctuation, but the high level point of a prefix-free repre- 
sentation for sentences still holds.) 

It turns out that we can transform every representation to a prefix- 
free form. This justifies Big Idea 1, and allows us to transform a repre- 
sentation scheme for objects of a type T to a representation scheme of 
lists of objects of the type T. By repeating the same technique, we can 
also represent lists of lists of objects of type T, and so on and so forth. 
But first let us formally define prefix-freeness: 


Definition 2.17 — Prefix free encoding. For two strings y, y’, we say that y 
is a prefix of y’ if |y| < |y’| and for every i < |y|, y; = y;. 

Let O be a non-empty set and E : © — {0,1}* bea function. We 
say that E is prefix-free if E(o) is non-empty foreveryo € O and 
there does not exist a distinct pair of objects 0,0 © © such that 
E(o) is a prefix of E(o’). 


Recall that for every set ØO, the set O* consists of all finite length 
tuples (i.e., lists) of elements in 0. The following theorem shows that 
if E is a prefix-free encoding of O then by concatenating encodings we 
can obtain a valid (i.e., one-to-one) representation of 0*: 


Theorem 2.18 — Prefix-free implies tuple encoding. Suppose that E : O > 
{0, 1}* is prefix-free. Then the following map E : 0* — {0,1}* is 
one to one, for every op, ...,0,_, E O*, we define 


E(09, +++ 0x1) = E(09) E(01) = E(0g_1) - 
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Proof Idea: 


The idea behind the proof is simple. Suppose that for example 
we want to decode a triple (09, 01,03) from its representation x = Figure 2.9: Uwe have à pref treerepresentatian ot 
— each object then we can concatenate the representa- 
E(09,01,02) = E(09)E(01)E(02). We will do so by first finding the tions of k objects to obtain a representation for the 
first prefix x, of x that is a representation of some object. Then we tuple (09, .. , 0-1). 
will decode this object, remove zy from z to obtain a new string 2’, 
and continue onwards to find the first prefix x, of x’ and so on and so 
forth (see Exercise 2.9). The prefix-freeness property of E will ensure 
that x, will in fact be E(09), xı will be E(0,), ete. 

* 


Proof of Theorem 2.18. We now show the formal proof. Suppose, to- 
wards the sake of contradiction, that there exist two distinct tuples 
(09, .-,Op_1) and (06, ...,0;,_,) Such that 


E (09) «+: 504-1) = E(0%y 0. 0%) - (2.4) 


We will denote the string E(09, ... , 0,_) by T. 

Let i be the first index such that o; # 0. (If o; = o; for all i then, 
since we assume the two tuples are distinct, one of them must be 
larger than the other. In this case we assume without loss of generality 
that k’ > k and let i = k.) In the case that i < k, we see that the string 
Tz can be written in two different ways: 


T= E(o, 11 Opr_y) = To Ti 1 E(0;) Elo) Elop) 


where x; = E(0;) = E(0;) for all j < i. Let y be the string obtained 
after removing the prefix £o ---x,;_, from 7T. We see that y can be writ- 
ten as both y = E(o,)s for some string s € {0,1}* and asy = E(o;)s’ 
for some s’ € {0,1}*. But this means that one of E(0;) and E(o;) must 
be a prefix of the other, contradicting the prefix-freeness of FE. 

In the case that i = kand k’ > k, we get a contradiction in the 
following way. In this case 
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T = E(09) = E(ox_1) = E(00) + E(0,_1) E (oy) + Elop a) 


which means that E(0;) --- E(o;,,_,) must correspond to the empty 
string "". But in such a case E(o;,) must be the empty string, which in 
particular is the prefix of any other string, contradicting the prefix- 


freeness of E. 


2.5.3 Making representations prefix-free 

Some natural representations are prefix-free. For example, every fixed 
output length representation (i.e., one-to-one function E : O + {0,1}”) 
is automatically prefix-free, since a string x can only be a prefix of 

an equal-length x’ if x and x’ are identical. Moreover, the approach 
we used for representing rational numbers can be used to show the 
following: 


Lemma 2.20 Let E : O — {0,1}* be a one-to-one function. Then there is 
a one-to-one prefix-free encoding E such that |E(o)| < 2|E(o)| + 2 for 
every o € O. 


Proof of Lemma 2.20. The idea behind the proof is to use the map 0 > 
00,1 + 11 to “double” every bit in the string x and then mark the 
end of the string by concatenating to it the pair 01. If we encode a 
string x in this way, it ensures that the encoding of x is never a prefix 
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of the encoding of a distinct string x’. Formally, we define the function 
PF : {0,1}* > {0, 1}* as follows: 


PF(x) = @p%oX1 Ly... Lp _1Ly_1 01 


foreveryx € {0,1}*. fE : © — {0,1}* is the (potentially not 
prefix-free) representation for O, then we transform it into a prefix- 
free representation E : O — {0,1}* by defining E(o) = PF(E(o)). 

To prove the lemma we need to show that (1) E is one-to-one and 
(2) E is prefix-free. In fact, prefix freeness is a stronger condition than 
one-to-one (if two strings are equal then in particular one of them is a 
prefix of the other) and hence it suffices to prove (2), which we now 
do. 

Leto + o’ in O be two distinct objects. We will prove that E (o) is 
not a prefix of E(o’), or in other words PF(z) is not a prefix of PF(2’) 
where x = E(o) and x’ = E(o’). Since E is one-to-one, x # x’. We 
will split into three cases, depending on whether |x| < |x|, || = |2’|, 
or |x| > |x|. If |x| < |x’| then the two bits in positions 2|z|, 2|z| + 1 
in PF(x) have the value 01 but the corresponding bits in PF(x’) will 
equal either 00 or 11 (depending on the |z|-th bit of x’) and hence 
PF(a) cannot be a prefix of PF(x’). If |x| = |x’| then, since x # x’, 
there must be a coordinate 7 in which they differ, meaning that the 
strings PF(x) and PF(«’) differ in the coordinates 2i, 2i + 1, which 
again means that PF(x) cannot be a prefix of PF(x’). If |x| > |2’| 
then |PF(a)| = 2|a| +2 > |PF(x’)| = 2|x’| + 2 and hence PF(x) is 
longer than (and cannot be a prefix of) PF(«’). In all cases we see that 
PF(x) = E(o) is not a prefix of PF(x’) = E(o’), hence completing the 
proof. 

a 


The proof of Lemma 2.20 is not the only or even the best way to 
transform an arbitrary representation into prefix-free form. Exer- 
cise 2.10 asks you to construct a more efficient prefix-free transforma- 
tion satisfying |E(o)| < |E(o)| + O(log |E(o)|). 


2.5.4 “Proof by Python” (optional) 
The proofs of Theorem 2.18 and Lemma 2.20 are constructive in the 
sense that they give us: 


e A way to transform the encoding and decoding functions of any 
representation of an object O to encoding and decoding functions 
that are prefix-free, and 


e Away to extend prefix-free encoding and decoding of single objects 
to encoding and decoding of lists of objects by concatenation. 
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Specifically, we could transform any pair of Python functions en- 
code and decode to functions pfencode and pfdecode that correspond 
to a prefix-free encoding and decoding. Similarly, given pfencode and 
pfdecode for single objects, we can extend them to encoding of lists. 
Let us show how this works for the case of the NtS and StN functions 
we defined above. 

We start with the “Python proof” of Lemma 2.20: a way to trans- 
form an arbitrary representation into one that is prefix free. The func- 
tion prefixfree below takes as input a pair of encoding and decoding 
functions, and returns a triple of functions containing prefix-free encod- 
ing and decoding functions, as well as a function that checks whether 
a string is a valid encoding of an object. 


takes functions encode and decode mapping 
objects to lists of bits and vice versa, 

and returns functions pfencode and pfdecode that 
maps objects to lists of bits and vice versa 

in a prefix-free way. 

Also returns a function pfvalid that says 


e# HR RR HR HR HR HK 


whether a list is a valid encoding 
def prefixfree(encode, decode): 
def pfencode(o): 
L = encode(o) 
return [L[i//2] for i in range(2«*len(L))J+([0,1] 
def pfdecode(L): 
return decode(LL[j] for j in range(@,len(L)-2,2)]) 
def pfvalid(L): 
return (len(L) % 2 == @ ) and all(L[2*iJ==L[2*i+1] 
ə for i in range((len(L)-2)//2)) and 
ə L[-2:]==[0,1] 


return pfencode, pfdecode, pfvalid 


pfNtS, pfStN , pfvalidN = prefixfree(NtS, StN) 


NtS(234) 

# 11101010 
pfNtS(234) 

# 111111001100110001 
pfStN(pfNtS(234)) 

# 234 
pfvalidM(pfNtS(234)) 
# true 


We now show a “Python proof” of Theorem 2.18. Namely, we show 
a function represlists that takes as input a prefix-free representation 


scheme (implemented via encoding, decoding, and validity testing 
functions) and outputs a representation scheme for lists of such ob- 
jects. If we want to make this representation prefix-free then we could 
fit it into the function prefixfree above. 


def represlists(pfencode, pfdecode, pfvalid): 


nun 


Takes functions pfencode, pfdecode and pfvalid, 


and returns functions encodelists, decodelists 


that can encode and decode lists of the objects 


respectively. 


nun 


def 


def 


encodelist(L): 
""Gets list of objects, encodes it as list of 
an pits" 


return "".join([pfencode(obj) for obj in L]) 


decodelist(S): 
"Gets lists of bits, returns lists of objects" 
i=0; j=1 ; res = [] 
while j<=len(S): 
if pfvalid(SLi:j]): 
res += [pfdecode(SLli:j])] 
i=j 
j+= 1 
return res 


return encodelist,decodelist 


LtS , StL = represlists(pfNtS,pfStN,pfvalidN) 
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Lts([234,12,5]) 

# 111111001100110001111100000111001101 
StL(LtS([234,12,5])) 

# [234, 12, 5] 


2.5.5 Representing letters and text 

We can represent a letter or symbol by a string, and then if this rep- 
resentation is prefix-free, we can represent a sequence of symbols by 
merely concatenating the representation of each symbol. One such 
representation is the ASCII that represents 128 letters and symbols 
as strings of 7 bits. Since the ASCII representation is fixed-length, it 
is automatically prefix-free (can you see why?). Unicode is the rep- 
resentation of (at the time of this writing) about 128,000 symbols as 
numbers (known as code points) between 0 and 1, 114, 111. There are 
several types of prefix-free representations of the code points, a pop- 
ular one being UTF-8 that encodes every codepoint into a string of 
length between 8 and 32. 


e e ee o e ee 
a Example 2.21 — The Braille representation. The Braille system is another x e o M e j e S M 


way to encode letters and other symbols as binary strings. Specifi- 
cally, in Braille, every letter is encoded as a string in {0, 1}°, which 
$ i ; i : Figure 2.10: The word “Binary” in “Grade 1” or 
is written using indented dots arranged in two columns and three “uncontracted” Unified English Braille. This word is 
rows, see Fig. 2.10. (Some symbols require more than one six-bit encoded using seven symbols since the first one is a 
string to encode, and so Braille uses a more general prefix-free thodifier indicating-thiat thefirst letters capitalized, 
encoding.) 

The Braille system was invented in 1821 by Louis Braille when 
he was just 12 years old (though he continued working on it and 
improving it throughout his life). Braille was a French boy who 


lost his eyesight at the age of 5 as the result of an accident. 


m Example 2.22 — Representing objects in C (optional). We can use pro- 
gramming languages to probe how our computing environment 
represents various values. This is easiest to do in “unsafe” pro- 
gramming languages such as C that allow direct access to the 
memory. 

Using a simple C program we have produced the following 
representations of various values. One can see that for integers, 
multiplying by 2 corresponds to a “left shift” inside each byte. In 
contrast, for floating-point numbers, multiplying by two corre- 
sponds to adding one to the exponent part of the representation. 
In the architecture we used, a negative number is represented 
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using the two’s complement approach. C represents strings in a 
prefix-free form by ensuring that a zero byte is at their end. 


int 2 : QOOOOQIA 00000000 00000000 00000000 
int 4 : 00000100 00000000 00000000 00000000 
int 513 : 00000001 00000010 00000000 00000000 
long 513 : 00000001 80000010 00000000 00000000 


+ 00000000 00000000 20000000 00000000 
int al A A aA 
int =2 eM ANOS A T allt 
string Hello: 01001000 01100101 01101100 01101100 
+ 1101111 00000000 
string abcd : 01100001 01100010 01100011 01100100 
+ 00000000 
float 33.0 : 00000000 00000000 20000100 01000010 
float 66.0 : 00000000 00000000 10000100 01000010 
float 132.0: 00000000 00000000 00000100 01000011 
double 132.0: 00000000 00000000 00000000 00000000 
+ 00000000 10000000 01100000 01000000 


2.5.6 Representing vectors, matrices, images 

Once we can represent numbers and lists of numbers, then we can also 
represent vectors (which are just lists of numbers). Similarly, we can 
represent lists of lists, and thus, in particular, can represent matrices. 
To represent an image, we can represent the color at each pixel by a 
list of three numbers corresponding to the intensity of Red, Green and 
Blue. (We can restrict to three primary colors since most humans only 
have three types of cones in their retinas; we would have needed 16 
primary colors to represent colors visible to the Mantis Shrimp.) Thus 
an image of n pixels would be represented by a list of n such length- 
three lists. A video can be represented as a list of images. Of course 
these representations are rather wasteful and much more compact 
representations are typically used for images and videos, though this 
will not be our concern in this book. 


2.5.7 Representing graphs 
A graph on n vertices can be represented as an n x n adjacency matrix 
whose (i, j) entry is equal to 1 if the edge (i, j) is present and is 
equal to 0 otherwise. That is, we can represent an n vertex directed 
graph G = (V, E) as a string A € {0, 1}” such that A; ; = liff the 
edge ij € E. We can transform an undirected graph to a directed 
graph by replacing every edge {i, j} with both edges i j andi j 
Another representation for graphs is the adjacency list representa- 
tion. That is, we identify the vertex set V of a graph with the set [n] 
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where n = |V|, and represent the graph G = (V, E) asa list of n 
lists, where the i-th list consists of the out-neighbors of vertex i. The 
difference between these representations can be significant for some 
applications, though for us would typically be immaterial. 


2.5.8 Representing lists and nested lists 

If we have a way of representing objects from a set O as binary strings, 
then we can represent lists of these objects by applying a prefix-free 
transformation. Moreover, we can use a trick similar to the above to 
handle nested lists. The idea is that if we have some representation 

E : O — {0,1}*, then we can represent nested lists of items from 

O using strings over the five element alphabet © = { 0,1,[,1,, }. 
For example, if 0, is represented by 0011, o, is represented by 10011, 
and o; is represented by 00111, then we can represent the nested list 
(01, (09, 03)) as the string "[0011,[10011,00111]]" over the alphabet 
X. By encoding every element of X itself as a three-bit string, we can 
transform any representation for objects O into a representation that 
enables representing (potentially nested) lists of these objects. 


2.5.9 Notation 
We will typically identify an object with its representation as a string. 
For example, if F : {0,1}* — {0,1}* is some function that maps 
strings to strings and n is an integer, we might make statements such 
as “F (n) + 1 is prime” to mean that if we represent n as a string z, 
then the integer m represented by the string F(x) satisfies that m + 1 
is prime. (You can see how this convention of identifying objects with 
their representation can save us a lot of cumbersome formalism. ) 
Similarly, if x, y are some objects and F is a function that takes strings 
as inputs, then by F(x,y) we will mean the result of applying F to the 
representation of the ordered pair (x, y). We use the same notation to 
invoke functions on k-tuples of objects for every k. 

This convention of identifying an object with its representation as 
a string is one that we humans follow all the time. For example, when 
people say a statement such as “17 is a prime number”, what they 
really mean is that the integer whose decimal representation is the 
string “17”, is prime. 


When we say 


A is an algorithm that computes the multiplication function on natural num- 
bers. 


what we really mean is that 


A is an algorithm that computes the function F : {0,1}* — {0,1}* such that 
for every pair a,b € N,ifx € {0,1}* is a string representing the pair (a,b) 
then F(x) will be a string representing their product a - b. 


od 


Figure 2.11: Representing the graph G = 

({0, 1, 2,3, 4}, {(1, 0), (4, 0), (1, 4), (4, 1), (2, 1), (3, 2), (4,3)}) 
in the adjacency matrix and adjacency list representa- 

tions. 
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2.6 DEFINING COMPUTATIONAL TASKS AS MATHEMATICAL FUNC- 
TIONS 


Abstractly, a computational process is some process that takes an input 
which is a string of bits and produces an output which is a string 

of bits. This transformation of input to output can be done using a 
modern computer, a person following instructions, the evolution of 
some natural system, or any other means. 

In future chapters, we will turn to mathematically defining com- 
putational processes, but, as we discussed above, at the moment we 
focus on computational tasks. That is, we focus on the specification and 
not the implementation. Again, at an abstract level, a computational 
task can specify any relation that the output needs to have with the in- 
put. However, for most of this book, we will focus on the simplest and 
most common task of computing a function. Here are some examples: 


e Given (a representation of) two integers x, y, compute the product 
x x y. Using our representation above, this corresponds to com- 
puting a function from {0, 1}* to {0,1}*. We have seen that there is 
more than one way to solve this computational task, and in fact, we 
still do not know the best algorithm for this problem. 


e Given (a representation of) an integer z > 1, compute its factoriza- 
tion; i.e., the list of primes p, < -= < p, such that z = pı -+ pp. This 
again corresponds to computing a function from {0, 1}* to {0, 1}*. 
The gaps in our knowledge of the complexity of this problem are 
even larger. 


e Given (a representation of) a graph G and two vertices s and t, 
compute the length of the shortest path in G between s and t, or do 
the same for the longest path (with no repeated vertices) between 
s and t. Both these tasks correspond to computing a function from 
{0, 1}* to {0, 1}*, though it turns out that there is a vast difference in 
their computational difficulty. 


e Given the code of a Python program, determine whether there is an 
input that would force it into an infinite loop. This task corresponds 
to computing a partial function from {0, 1}* to {0, 1} since not every 
string corresponds to a syntactically valid Python program. We will 
see that we do understand the computational status of this problem, 
but the answer is quite surprising. 


e Given (a representation of) an image J, decide if J is a photo of a 
cat or a dog. This corresponds to computing some (partial) func- 
tion from {0, 1}* to {0, 1}. 
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(R) 


For every particular function F, there can be several possible algo- 
rithms to compute F. We will be interested in questions such as: 


e Fora given function F, can it be the case that there is no algorithm to 
compute F? 


e If there is an algorithm, what is the best one? Could it be that F is 
“effectively uncomputable” in the sense that every algorithm for 
computing F requires a prohibitively large amount of resources? 


e If we cannot answer this question, can we show equivalence be- 
tween different functions F and F” in the sense that either they are 
both easy (ie., have fast algorithms) or they are both hard? 


e Cana function being hard to compute ever be a good thing? Can we 
use it for applications in areas such as cryptography? 


In order to do that, we will need to mathematically define the no- 
tion of an algorithm, which is what we will do in Chapter 3. 


2.6.1 Distinguish functions from programs! 

You should always watch out for potential confusions between speci- 
fications and implementations or equivalently between mathematical 
functions and algorithms /programs. It does not help that program- 
ming languages (Python included) use the term “functions” to denote 
(parts of) programs. This confusion also stems from thousands of years 
of mathematical history, where people typically defined functions by 
means of a way to compute them. 

For example, consider the multiplication function on natural num- 
bers. This is the function MULT : N x N > N that maps a pair (x, y) 
of natural numbers to the number z - y. As we mentioned, it can be 
implemented in more than one way: 


Figure 2.12: A subset L C {0,1}* can be identified 
with the function F : {0,1}* — {0,1} such that 
F(x) = 1ifx € Land F(x) = Oif a ¢ L. Functions 
with a single bit of output are called Boolean functions, 
while subsets of strings are called languages. The 
above shows that the two are essentially the same 
object, and we can identify the task of deciding 
membership in L (known as deciding a language in the 
literature) with the task of computing the function F. 


def mult1(x,y): 


res = 0 

while y>Q: 
res += X 
y -= 1 


return res 


def mult2(x,y): 


a = str(x) # represent x as string in decimal notation 
b = str(y) # represent y as string in decimal notation 


res = Q 
for i in range(len(a)): 
for j in range(len(b)): 
res += int(aLlen(a)-i])*int(bLlen(b) - 
ə J])*(10**(it+j)) 
return res 


print(mult1(12,7)) 
# 84 
print(mult2(12,7)) 
# 84 


Both mult1 and mult2 produce the same output given the same 
pair of natural number inputs. (Though mult1 will take far longer to 


do so when the numbers become large.) Hence, even though these are 
two different programs, they compute the same mathematical function. 
This distinction between a program or algorithm A, and the function F 
that A computes will be absolutely crucial for us in this course (see also 
Fig. 2.13). 


Distinguishing functions from programs (or other ways for comput- 
ing, including circuits and machines) is a crucial theme for this course. 
For this reason, this is often a running theme in questions that I (and 
many other instructors) assign in homework and exams (hint, hint). 


(R) 
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This is NOT a function: This IS a function: 


even(x) 


function even(x) { 
return 1 - (x % 2); 


1 
0 
1 
o 


aj al a] of n| =| of & 


Figure 2.13: A function is a mapping of inputs to 
outputs. A program is a set of instructions on how 
to obtain an output given an input. A program 
computes a function, but it is not the same as a func- 
tion, popular programming language terminology 
notwithstanding. 
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2.7 EXERCISES 


Exercise 2.1 Which one of these objects can be represented by a binary 
string? 


a. An integer x 
b. An undirected graph G. 
c. A directed graph H 


d. All of the above. 


Exercise 2.2 — Binary representation. a. Prove that the function NtS : N > 
{0, 1}* of the binary representation defined in (2.1) satisfies that for 
every n € N, if x = NtS(n) then |x| = 1 + max(0, [log, n]) and 
a, = |w/2!82"I-*| mod 2. 


b. Prove that NtS is a one to one function by coming up with a func- 
tion StN : {0,1}* —> N such that StN(NtS(n)) = n for every 
nel. 

m 


Exercise 2.3 — More compact than ASCII representation. The ASCII encoding 
can be used to encode a string of n English letters as a 7n bit binary 
string, but in this exercise, we ask about finding a more compact rep- 
resentation for strings of English lowercase letters. 


1. Prove that there exists a representation scheme (E, D) for strings 
over the 26-letter alphabet {a, b, c,..., z} as binary strings such 
that for every n > 0 and length-n string x € {a,b,...,z}”, the 
representation E(x) is a binary string of length at most 4.8n + 1000. 
In other words, prove that for every n, there exists a one-to-one 
function E : {a,b,...,2}" — {0, 1}1487+1000], 


2. Prove that there exists no representation scheme for strings over the 
alphabet {a, b, ... ,z} as binary strings such that for every length-n 
string x € {a,b,...,z}", the representation E(x) is a binary string of 
length |4.6n + 1000]. In other words, prove that there exists some 


n > 0 such that there is no one-to-one function E : {a,b,...,z}" > 
{0 1} {4.6n+1000| . 


3. Python’s bz2. compress function is a mapping from strings to 
strings, which uses the lossless (and hence one to one) bzip2 algo- 
rithm for compression. After converting to lowercase, and truncat- 
ing spaces and numbers, the text of Tolstoy’s “War and Peace” con- 
tains n = 2,517, 262. Yet, if we run bz2. compress on the string of 
the text of “War and Peace” we get a string of length k = 6, 274, 768 
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bits, which is only 2.49n (and in particular much smaller than 
4.6n). Explain why this does not contradict your answer to the 
previous question. 


4. Interestingly, if we try to apply bz2. compress on a random string, 
we get much worse performance. In my experiments, I got a ratio 
of about 4.78 between the number of bits in the output and the 
number of characters in the input. However, one could imagine that 
one could do better and that there exists a company called “Pied 
Piper” with an algorithm that can losslessly compress a string of n 
random lowercase letters to fewer than 4.6n bits. Show that this 
is not the case by proving that for every n > 100 and one to one 
function Encode : {a,...,z}" — {0,1}*, if we let Z be the random 
variable |Encode()| (i.e., the length of Encode(x)) for x chosen 
uniformly at random from the set {a,... , z }”, then the expected 
value of Z is at least 4.6n. 


Exercise 2.4 — Representing graphs: upper bound. Show that there is a string 
representation of directed graphs with vertex set [n] and degree at 
most 10 that uses at most 1000n log n bits. More formally, show the 
following: Suppose we define for every n € N, the set G, as the set 
containing all directed graphs (with no self loops) over the vertex 

set [n] where every vertex has degree at most 10. Then, prove that for 


every sufficiently large n, there exists a one-to-one function E : G, > 
{0 1} 1000n logn], 


Exercise 2.5 — Representing graphs: lower bound. 1. Define S, to be the 
set of one-to-one and onto functions mapping [n] to [n]. Prove that 
there is a one-to-one mapping from S,, to G,,,, where Ga, is the set 
defined in Exercise 2.4 above. 


2. In this question you will show that one cannot improve the rep- 
resentation of Exercise 2.4 to length o(n logn). Specifically, prove 
for every sufficiently large n € N there is no one-to-one function 
E:G > {0 1}L0-001n logn |+1000, 


Exercise 2.6 — Multiplying in different representation. Recall that the grade- 
school algorithm for multiplying two numbers requires O(n?) oper- 
ations. Suppose that instead of using decimal representation, we use 
one of the following representations R(x) to represent a number x 
between 0 and 10” — 1. For which one of these representations you can 
still multiply the numbers in O(n”) operations? 


5 Actually that particular fictional company uses a 
metric that focuses more on compression speed then 
ratio, see here and here. 
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a. The standard binary representation: B(x) = (o,...,2,) where 
L= sn x,2' and k is the largest number s.t. x > 2". 


b. The reverse binary representation: B(x) = (x,,..., £o) where z; is 
defined as above for i = 0,...,k —1. 


c. Binary coded decimal representation: B(x) = (Yo, --- , Yn—1) Where 
y; € {0, 1}“ represents the it” decimal digit of £ mapping 0 to 0000, 
1 to 0001, 2 to 0010, etc. (i.e. 9 maps to 1001) 


d. All of the above. 
a 


Exercise 2.7 Suppose that R : N > {0,1}* corresponds to representing a 
number z as a string of x 1’s, (e.g., R(4) = 1111, R(7) = 1111111, etc.). 
If x, y are numbers between 0 and 10” — 1, can we still multiply z and 
y using O(n”) operations if we are given them in the representation 
R(-)? 

a 
Exercise 2.8 Recall that if F is a one-to-one and onto function mapping 
elements of a finite set U into a finite set V then the sizes of U and V 
are the same. Let B : N — {0,1}* be the function such that for every 
x € N, B(x) is the binary representation of x. 


1. Prove that x < 2* if and only if |B(x)| < k. 


2. Use 1. to compute the size of the set {y € {0,1}* : |y| < k} where |y| 
denotes the length of the string y. 


3. Use 1. and 2. to prove that 2-1 =142444..-4 91, 
m 


Exercise 2.9 — Prefix-free encoding of tuples. Suppose that F : N — {0, 1}* 
is a one-to-one function that is prefix-free in the sense that there is no 
a#bs.t. F(a) isa prefix of F(b). 


a. Prove that F, : N x N > {0,1}*, defined as F(a, b) = F(a) F(b) (i.e., 
the concatenation of F(a) and F(b)) is a one-to-one function. 


b. Prove that F, : N* — {0,1}* defined as F,(a,,...,a,) = 
F(a) F(a;,) is a one-to-one function, where N* denotes the set of 
all finite-length lists of natural numbers. 


Exercise 2.10 — More efficient prefix-free transformation. Suppose that F : 

O — {0, 1}* is some (not necessarily prefix-free) representation of the 
objects in the set O, and G : N — {0,1}* is a prefix-free representa- 
tion of the natural numbers. Define F'’(o) = G(|F(o)|)F(o) (ie., the 
concatenation of the representation of the length F (o) and F(o)). 
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a. Prove that F” is a prefix-free representation of O. 


b. Show that we can transform any representation to a prefix-free one 


by a modification that takes a k bit string into a string of length at 
most k + O(log k). 


c. Show that we can transform any representation to a prefix-free one 
by a modification that takes a k bit string into a string of length at 
most k + log k + O(log log k).® 


Exercise 2.11 — Kraft’s Inequality. Suppose that S C {0,1}* is some finite 
prefix-free set, and let n some number larger than max{|z| : x € X}. 


a. For every x € S, let L(x) C {0,1}” denote all the length-n strings 


whose first k bits are x9, ...,%,_1- Prove that (1) |L(z)| = 2”-!*! and 


(2) For every distinct x, x’ € S, L(x) is disjoint from L(x’). 


b. Prove that }7-< 2-|*| < 1. (Hint: first show that Seg le) |e") 


c. Prove that there is no prefix-free encoding of strings with less than 
logarithmic overhead. That is, prove that there is no function PF : 


{0, 1}* > {0,1}* s.t. |PF(x)| < |x| + 0.9 log |x| for every sufficiently 
large x € {0,1}* and such that the set {PF(x) : x € {0,1}*} is prefix- 


free. The factor 0.9 is arbitrary; all that matters is that it is less than 
1. 


Exercise 2.12 — Composition of one-to-one functions. Prove that for every 
two one-to-one functions F : S > T and G : T — U, the function 
H : S — U defined as H(x) = G(F(x)) is one to one. 


Exercise 2.13 — Natural numbers and strings. 1. We have shown that 
the natural numbers can be represented as strings. Prove that 
the other direction holds as well: that there is a one-to-one map 
StN : {0,1}* > N. (StN stands for “strings to numbers.”) 


2. Recall that Cantor proved that there is no one-to-one map RtN : 
R — N. Show that Cantor’s result implies Theorem 2.5. 


Exercise 2.14 — Map lists of integers to a number. Recall that for every set 
S, the set S* is defined as the set of all finite sequences of mem- 
bers of S (ie.,S* = {(%,.-,2n-1) [n EN, Vienti E SI). 
Prove that there is a one-one-map from Z* to N where Z is the set of 
{..., —3, —2, —1, 0, +1, +2, +3, ...} of all integers. 


6 Hint: Think recursively how to represent the length 
of the string. 
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2.8 BIBLIOGRAPHICAL NOTES 


The study of representing data as strings, including issues such as 
compression and error corrections falls under the purview of information 
theory, as covered in the classic textbook of Cover and Thomas [CT06]. 
Representations are also studied in the field of data structures design, as 
covered in texts such as [Cor+09]. 

The question of whether to represent integers with the most signif- 
icant digit first or last is known as Big Endian vs. Little Endian repre- 
sentation. This terminology comes from Cohen's [Coh81] entertaining 
and informative paper about the conflict between adherents of both 
schools which he compared to the warring tribes in Jonathan Swift’s 
“Gulliver's Travels”. The two’s complement representation of signed 
integers was suggested in von Neumann’s classic report [Neu45] 
that detailed the design approaches for a stored-program computer, 
though similar representations have been used even earlier in abacus 
and other mechanical computation devices. 

The idea that we should separate the definition or specification of 
a function from its implementation or computation might seem “obvi- 
ous,” but it took quite a lot of time for mathematicians to arrive at this 
viewpoint. Historically, a function F was identified by rules or formu- 
las showing how to derive the output from the input. As we discuss 
in greater depth in Chapter 9, in the 1800s this somewhat informal 
notion of a function started “breaking at the seams,” and eventually 
mathematicians arrived at the more rigorous definition of a function 
as an arbitrary assignment of input to outputs. While many functions 
may be described (or computed) by one or more formulas, today we 
do not consider that to be an essential property of functions, and also 
allow functions that do not correspond to any “nice” formula. 

We have mentioned that all representations of the real numbers 
are inherently approximate. Thus an important endeavor is to under- 
stand what guarantees we can offer on the approximation quality of 
the output of an algorithm, as a function of the approximation quality 
of the inputs. This question is known as the question of determining 
the numerical stability of given equations. The Floating-Point Guide 
website contains an extensive description of the floating-point repre- 
sentation, as well the many ways in which it could subtly fail, see also 
the website 0.30000000000000004.com. 

Dauben [Dau90] gives a biography of Cantor with emphasis on 
the development of his mathematical ideas. [Hal60] is a classic text- 
book on set theory, also including Cantor’s theorem. Cantor’s Theo- 
rem is also covered in many texts on discrete mathematics, including 
[LLM18; LZ19]. 
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The adjacency matrix representation of graphs is not merely a con- 
venient way to map a graph into a binary string, but it turns out that 
many natural notions and operations on matrices are useful for graphs 
as well. (For example, Google’s PageRank algorithm relies on this 
viewpoint.) The notes of Spielman’s course are an excellent source for 
this area, known as spectral graph theory. We will return to this view 
much later in this book when we talk about random walks. 


NITE COMPUTATION 


Learning Objectives: 


See that computation can be precisely 
modeled. 


Learn the computational model of Boolean 
circuits / straight-line programs. 


Equivalence of circuits and straight-line 
programs. 


Equivalence of AND/OR/NOT and NAND. 


Examples of computing in the physical world. 


3 
Defining computation 


“there is no reason why mental as well as bodily labor should not be economized 
by the aid of machinery”, Charles Babbage, 1852 


“Tf, unwarned by my example, any man shall undertake and shall succeed 

in constructing an engine embodying in itself the whole of the executive de- 
partment of mathematical analysis upon different principles or by simpler 
mechanical means, I have no fear of leaving my reputation in his charge, for he 
alone will be fully able to appreciate the nature of my efforts and the value of 
their results.”, Charles Babbage, 1864 


“To understand a program you must become both the machine and the pro- 
gram.”, Alan Perlis, 1982 


People have been computing for thousands of years, with aids 
that include not just pen and paper, but also abacus, slide rules, vari- 
ous mechanical devices, and modern electronic computers. A priori, 
the notion of computation seems to be tied to the particular mech- 
anism that you use. You might think that the “best” algorithm for 


multiplying numbers will differ if you implement it in Python on a 


modern laptop than if you use pen and paper. However, as we saw 
in the introduction (Chapter 0) l ithm that i mptoticall Figure 3.1: Calculating wheels by Charles Babbage. 
1 NE EOS TICE APES pan a SOn eals asymprouca y Image taken from the Mark I ‘operating manual’ 


better would eventually beat a worse one regardless of the underly- 


Robot Works Problems Never Before Solved 


ing technology. This gives us hope for a technology independent way 

of defining computation. This is what we do in this chapter. We will 
define the notion of computing an output from an input by applying a 
sequence of basic operations (see Fig. 3.3). Using this, we will be able 
to precisely define statements such as “function f can be computed 
by model X” or “function f can be computed by model X using s 


operations”. 


Figure 3.2: A 1944 Popular Mechanics article on the 
Harvard Mark I computer. 


Compiled on 12.19.2022 22:58 
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What 


00 0 
01 1 
10 1 
11 11 


Finite functions 


How 

tl = AND(X[0],X[1]) 
notx® = NOT(X[9]) 

J> t2 = AND(notx0,X[2]) 
Y[0] = OR(t1,t2) 
u = NAND(X[@],X[1]) 
v = NAND(X[@],u) 
w = NAND(X[1],u) 
Y[@] = NAND(v,w) 


Computational models: 
Circuits, straight-line programs 


Figure 3.3: A function mapping strings to strings 
specifies a computational task, i.e., describes what the 
desired relation between the input and the output 
is. In this chapter we define models for implementing 
computational processes that achieve the desired 
relation, i.e., describe how to compute the output 
from the input. We will see several examples of such 
models using both Boolean circuits and straight-line 
programming languages. 


3.1 DEFINING COMPUTATION 


The name “algorithm” is derived from the Latin transliteration of 
Muhammad ibn Musa al-Khwarizmi’s name. Al-Khwarizmi was a 
Persian scholar during the 9th century whose books introduced the 
western world to the decimal positional numeral system, as well as to 
the solutions of linear and quadratic equations (see Fig. 3.4). However 
Al-Khwarizmi’s descriptions of algorithms were rather informal by 
today’s standards. Rather than use “variables” such as x, y, he used 
concrete numbers such as 10 and 39, and trusted the reader to be 
able to extrapolate from these examples, much as algorithms are still 
taught to children today. 

Here is how Al-Khwarizmi described the algorithm for solving an 
equation of the form z? + bx = c: 


[How to solve an equation of the form | “roots and squares are equal to num- 
bers”: For instance “one square, and ten roots of the same, amount to thirty- 
nine dirhems” that is to say, what must be the square which, when increased 
by ten of its own root, amounts to thirty-nine? The solution is this: you halve 
the number of the roots, which in the present instance yields five. This you 
multiply by itself; the product is twenty-five. Add this to thirty-nine’ the sum 
is sixty-four. Now take the root of this, which is eight, and subtract from it half 
the number of roots, which is five; the remainder is three. This is the root of the 
square which you sought for; the square itself is nine. 


For the purposes of this book, we will need a much more precise 
way to describe algorithms. Fortunately (or is it unfortunately?), at 
least at the moment, computers lag far behind school-age children 
in learning from examples. Hence in the 20th century, people came 
up with exact formalisms for describing algorithms, namely program- 
ming languages. Here is al-Khwarizmi’s quadratic equation solving 
algorithm described in the Python programming language: 


from math import sqrt 
#Pythonspeak to enable use of the sqrt function to compute 
ə square roots. 


def solve_eq(b,c): 
# return solution of x^2 + bx = c following Al 
o Khwarizmi's instructions 
# Al Kwarizmi demonstrates this for the case b=10 and 
> c= 39 
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Figure 3.4: Text pages from Algebra manuscript with 
geometrical solutions to two quadratic equations. 
Shelfmark: MS. Huntington 214 fol. 004v-005r 


BOUBLE-DICIT ADDITION 
‘Addition with regrouping is tricky to do, 
So here's a little rhyme to help you! 
Put your @ERSIUBIHIGD. and 
Your: 3 
‘Add them all together 
And you're ready to go! 


4 
+ 


5 
8 1 
Si 


Figure 3.5: An explanation for children of the two digit 
addition algorithm 
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vall = b / 2.0 # “halve the number of the roots" 
val2 = vall x vall # "this you multiply by itself" 
val3 = val2 + c # "Add this to thirty-nine" 

val4 = sqrt(val3) # "take the root of this" 


val5 = val4 - vall # “subtract from it half the number 
« Of roots" 

return val5 # "This is the root of the square which 
ə you sought for" 


# Test: solve x*2 + 10*x = 39 
print(solve_eq(10,39)) 
# 3.0 


We can define algorithms informally as follows: 


Informal definition of an algorithm: An algorithm is a set of instruc- 
tions for how to compute an output from an input by following a se- 
quence of “elementary steps”. 


An algorithm A computes a function F if for every input z, if we follow 
the instructions of A on the input x, we obtain the output F(x). 


In this chapter we will make this informal definition precise using 
the model of Boolean Circuits. We will show that Boolean Circuits 
are equivalent in power to straight line programs that are written in 
“ultra simple” programming languages that do not even have loops. 
We will also see that the particular choice of elementary operations is 
immaterial and many different choices yield models with equivalent 
power (see Fig. 3.6). However, it will take us some time to get there. 
We will start by discussing what are “elementary operations” and how 
we map a description of an algorithm into an actual physical process 
that produces an output from an input in the real world. 


, 3 . notx® = NOT(X[0]) 
(with AV, = gates) straightline programs +2 = AnD(notxo,x[2]) 


Y[0] = OR(t1,t2) 


Boolean Circuits ao AON-CIRC tı = AND(x[0],X[1]) 


(with operations) 


|1 |] 


(— N 


A 


i u = NAND(X[@],X[1]) 
NAND Circuits — NAND CIRC v = NAND(X[@],u) 
(with A gates) —> | straightline programs w = vano(x(1],u) 
i Y[@] = NAND(v,w) 


(with operation) 


| | |1 


Figure 3.6: An overview of the computational models 
defined in this chapter. We will show several equiv- 
alent ways to represent a recipe for performing a 
finite computation. Specifically we will show that we 
can model such a computation using either a Boolean 
circuit or a straight line program, and these two repre- 
sentations are equivalent to one another. We will also 
show that we can choose as our basic operations ei- 
ther the set {AND, OR, NOT} or the set {NAND} and 
these two choices are equivalent in power. By making 
the choice of whether to use circuits or programs, 
and whether to use {AND, OR, NOT} or {NAND} we 
obtain four equivalent ways of modeling finite com- 
putation. Moreover, there are many other choices of 
sets of basic operations that are equivalent in power. 


3.2 COMPUTING USING AND, OR, AND NOT. 


An algorithm breaks down a complex calculation into a series of sim- 
pler steps. These steps can be executed in a variety of different ways, 
including: 


e Writing down symbols on a piece of paper. 
e Modifying the current flowing on electrical wires. 
e Binding a protein to a strand of DNA. 


e Responding to a stimulus by a member of a collection (e.g., a bee in 
a colony, a trader in a market). 


To formally define algorithms, let us try to “err on the side of sim- 
plicity” and model our “basic steps” as truly minimal. For example, 
here are some very simple functions: 


e OR: {0,1}? > {0,1} defined as 


0 a=b=0 


1 otherwise 


OR(a, b) = l 


e AND: {0,1}? —> {0, 1} defined as 


1 a=b=1 


0 otherwise 


AND(a,b) = { 


e NOT: {0,1} — {0,1} defined as 


0 a=1 


NOT(a) = i _ 


The functions AND, OR and NOT, are the basic logical operators 
used in logic and many computer systems. In the context of logic, it is 
common to use the notation a ^ b for AND(a, b), a V b for OR(a, b) and 
a and ~a for NOT(a), and we will use this notation as well. 

Each one of the functions AND, OR, NOT takes either one or two 
single bits as input, and produces a single bit as output. Clearly, it 
cannot get much more basic than that. However, the power of compu- 
tation comes from composing such simple building blocks together. 


m Example 3.1 — Majority from AN D,OR and NOT. Consider the func- 
tion MAJ : {0,1} — {0,1} that is defined as follows: 
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MAII = f Tor ei T22 l 
0 otherwise 

That is, for every x € {0,1}, MAJ(x) = 1 if and only if the ma- 
jority (i.e., at least two out of the three) of x's elements are equal 
to 1. Can you come up with a formula involving AND, OR and 
NOT to compute MAJ? (It would be useful for you to pause at this 
point and work out the formula for yourself. As a hint, although 
the NOT operator is needed to compute some functions, you will 
not need to use it to compute MAJ.) 

Let us first try to rephrase MAJ(x) in words: “MAJ(x) = 1 if and 
only if there exists some pair of distinct elements i, j such that both 


x; and x, are equal to 1.” In other words it means that MAJ(z) = 1 
iff either both a9 = 1 and x, = 1, or both zı = land x, = 1, or both 
Ly = landx, = 1. Since the OR of three conditions cp, c,, c, can 


be written as OR(cy, OR(c,, c,)), we can now translate this into a 
formula as follows: 


MAN ces) = OR ( AND(z9, £1), OR AND £2), ANDi. £3)) ) - 

(3.1) 

Recall that we can also writea V bfor OR(a,b)anda ^ bfor 
AND (a,b). With this notation, (3.1) can also be written as 


MAJ(x9, £1, £2) = ((£0 A £1) V (£1 A £2)) V (£0 A £3) - 


We can also write (3.1) in a “programming language” form, 
expressing it as a set of instructions for computing MAJ given the 
basic operations AND, OR, NOT: 


def MAJ(X[9],X[1],X[2]): 
firstpair = AND(X[9],X[1]) 
secondpair = AND(X[1],X[2]) 
thirdpair = AND(X[@],X[2]) 
temp = OR(secondpair, thirdpair) 
return OR(firstpair, temp) 


3.2.1 Some properties of AND and OR 

Like standard addition and multiplication, the functions AND and OR 
satisfy the properties of commutativity: a V b =bVaandaAb=bAa 
and associativity: (a Vb) Vc =aV(bVc) and (aAb)Ac=aA(bAc). Asin 
the case of addition and multiplication, we often drop the parenthesis 


and write a VbV cV d for ((a V b) Vc) Vd, and similarly OR’s and AND’s 
of more terms. They also satisfy a variant of the distributive law: 


Solved Exercise 3.1 — Distributive law for AND and OR. Prove that for every 
a,b,c E€ {0,1},aA (bVc) =(aAb)V (adc). 


Solution: 

We can prove this by enumerating over all the 8 possible values 
for a,b,c € {0,1} but it also follows from the standard distributive 
law. Suppose that we identify any positive integer with “true” and 
the value zero with “false”. Then for every numbers u,v € N, u + v 
is positive if and only if u V v is true and u - v is positive if and only 
if u A vis true. This means that for every a, b,c € {0,1}, the expres- 
siona A (b V c) is true if and only if a - (b + c) is positive, and the 
expression (a A b) V (aA c) is true if and only if a- b + a- c is positive, 
But by the standard distributive law a - (b + c) = a -b +a- cand 
hence the former expression is true if and only if the latter one is. 


3.2.2 Extended example: Computing XOR from AND, OR, and NOT 
Let us see how we can obtain a different function from the same 
building blocks. Define XOR : {0,1}? — {0,1} to be the function 
XOR(a,b) = a+b mod 2. That is, XOR(0,0) = XOR(1,1) = 0 and 
XOR(1,0) = XOR(0,1) = 1. We claim that we can construct XOR 
using only AND, OR, and NOT. 


The following algorithm computes XOR using AND, OR, and NOT: 


Lemma 3.3 For every a,b € {0,1}, on input a, b, Algorithm 3.2 outputs 
a+b mod 2. 
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Proof. For every a,b, XOR(a,b) = 1 if and only if a is different from 
b. On input a,b € {0,1}, Algorithm 3.2 outputs AND(w2, w3) where 
w2 = NOT(AND(a, b)) and w3 = OR(a, b). 


e Ifa = b = 0 then w3 = OR(a,b) = 0 and so the output will be 0. 


e Ifa = b = 1 then AND(a,b) = 1 and so w2 = NOT(AND(a, b)) = 0 
and the output will be 0. 


e Ifa = landb = 0 (or vice versa) then both w3 = OR(a,b) = 1 
and w1 = AND(a, b) = 0, in which case the algorithm will output 
OR(NOT(w1), w3) = 1. 


We can also express Algorithm 3.2 using a programming language. 
Specifically, the following is a Python program that computes the XOR 
function: 


def AND(a,b): return axb 
def OR(a,b): return 1-(1-a)*(1-b) 
def NOT(a): return 1-a 


def XOR(a,b): 
w1 = AND(a,b) 
w2 = NOT(w1) 
w3 = OR(a,b) 
return AND(w2,w3) 


# Test out the code 

print(Cf"XOR({a}, {b})={XOR(a,b)}" for a in [0,1] for b in 
> [@,1]]) 

# ['XOR(@,@)=0', 'XOR(@,1)=1', 'XOR(1,@)=1', 'XOR(1,1)=0'] 


Solved Exercise 3.2 — Compute XOR on three bits of input. Let XOR; : 

{0,1}8 — {0,1} be the function defined as XOR3(a,b,c) =a +b+c 
mod 2. That is, XOR3(a,b,c) = 1 if a+b+cis odd, and XOR, (a,b,c) = 
0 otherwise. Show that you can compute XOR, using AND, OR, and 
NOT. You can express it as a formula, use a programming language 
such as Python, or use a Boolean circuit. 


Solution: 

Addition modulo two satisfies the same properties of associativ- 
ity ((a+b) +c = a + (b + c)) and commutativity (a + b = b + a) as 
standard addition. This means that, if we define a © b to equal a + b 


mod 2, then 
XOR;(a,b,c) =(a@b) @c 


or in other words 
XOR;(a, b, c) = XOR(XOR(a, b), c) . 


Since we know how to compute XOR using AND, OR, and 
NOT, we can compose this to compute XOR, using the same build- 
ing blocks. In Python this corresponds to the following program: 


def XOR3(a,b,c): 


wl = AND(a,b) 
w2 = NOT(w1) 
w3 = OR(a,b) 


w4 = AND(w2,w3) 
w5 = AND(w4,c) 

w6 = NOT(w5) 

w7 = OR(w4,c) 
return AND(w6,w7) 


# Let's test this out 

print (Lf"XOR3({a}, {b}, {c})={XOR3(a,b,c)}" for a in [0,1] 
ə for b in [0,1] for c in [@,1]]) 

# ['XOR3(0,0,@)=0', 'XOR3(0,0,1)=1', 'XOR3(0,1,0)=1', 

>  'XOR3(0,1,1)=0', 'XOR3(1,0,0)=1', 'XOR3(1,0,1)=0', 
o 'XOR3(1,1,0)=0', 'XOR3(1,1,1)=1'] 


3.2.3 Informally defining “basic operations” and “algorithms” 


We have seen that we can obtain at least some examples of interesting 
functions by composing together applications of AND, OR, and NOT. 
This suggests that we can use AND, OR, and NOT as our “basic opera- 
tions”, hence obtaining the following definition of an “algorithm”: 


Semi-formal definition of an algorithm: An algorithm consists of a 
sequence of steps of the form “compute a new value by applying AND, 
OR, or NOT to previously computed values”. 
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An algorithm A computes a function F if for every input x to F, if we 
feed x as input to the algorithm, the value computed in its last step is 


There are several concerns that are raised by this definition: 


. First and foremost, this definition is indeed too informal. We do not 


specify exactly what each step does, nor what it means to “feed x as 
input”. 


. Second, the choice of AND, OR or NOT seems rather arbitrary. 


Why not XOR and MAJ? Why not allow operations like addition 
and multiplication? What about any other logical constructions 
such if/then or while? 


. Third, do we even know that this definition has anything to do 


with actual computing? If someone gave us a description of such an 
algorithm, could we use it to actually compute the function in the 
real world? 


A large part of this book will be devoted to addressing the above 


issues. We will see that: 


1. 


We can make the definition of an algorithm fully formal, and so 
give a precise mathematical meaning to statements such as “Algo- 
rithm A computes function f”. 


. While the choice of AND/OR/NOT is arbitrary, and we could just 


as well have chosen other functions, we will also see this choice 
does not matter much. We will see that we would obtain the same 
computational power if we instead used addition and multiplica- 
tion, and essentially every other operation that could be reasonably 
thought of as a basic step. 


. It turns out that we can and do compute such “AND /OR/NOT- 


based algorithms” in the real world. First of all, such an algorithm 
is clearly well specified, and so can be executed by a human with a 
pen and paper. Second, there are a variety of ways to mechanize this 
computation. We’ve already seen that we can write Python code 
that corresponds to following such a list of instructions. But in fact 
we can directly implement operations such as AND, OR, and NOT 


via electronic signals using components known as transistors. This is 


how modern electronic computers operate. 


In the remainder of this chapter, and the rest of this book, we will 
begin to answer some of these questions. We will see more examples 
of the power of simple operations to compute more complex opera- 
tions including addition, multiplication, sorting and more. We will 
also discuss how to physically implement simple operations such as 
AND, OR and NOT using a variety of technologies. 


3.3 BOOLEAN CIRCUITS 


Boolean circuits provide a precise notion of “composing basic opera- 
tions together”. A Boolean circuit (see Fig. 3.9) is composed of gates 
and inputs that are connected by wires. The wires carry a signal that 
represents either the value 0 or 1. Each gate corresponds to either the 
OR, AND, or NOT operation. An OR gate has two incoming wires, 
and one or more outgoing wires. If these two incoming wires carry 
the signals a and b (for a,b € {0,1}), then the signal on the outgoing 
wires will be OR(a,b). AND and NOT gates are defined similarly. The 
inputs have only outgoing wires. If we set a certain input to a value 

a € {0,1}, then this value is propagated on all the wires outgoing 
from it. We also designate some gates as output gates, and their value 
corresponds to the result of evaluating the circuit. For example, ?? 
gives such a circuit for the XOR function, following Section 3.2.2. We 
evaluate an n-input Boolean circuit C on an input x € {0,1}” by plac- 
ing the bits of x on the inputs, and then propagating the values on the 


wires until we reach an output, see Fig. 3.9. 


Solved Exercise 3.3 — All equal function. Define ALLEQ : {0,1}*+ — {0,1} 
to be the function that on input x € {0,1}* outputs 1 if and only if 
Ly = L, = Ly = £3. Give a Boolean circuit for computing ALLEQ. 
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AND OR 
NOT NAND 
Figure 3.7: Standard symbols for the logical operations 


or “gates” of AND, OR, NOT, as well as the operation 
NAND discussed in Section 3.6. 


x1] 


10) 


Figure 3.8: A circuit with AND, OR and NOT gates for 
computing the XOR function. 


134 INTRODUCTION TO THEORETICAL COMPUTER SCIENCE 


Inputs: Gates: Output gate(s): Figure 3.9: A Boolean Circuit consists of gates that are 
poe connected by wires to one another and the inputs. The 
left side depicts a circuit with 2 inputs and 5 gates, 
one of which is designated the output gate. The right 
side depicts the evaluation of this circuit on the input 
x € {0,1}? with rọ = land z; = 0. The value of 
every gate is obtained by applying the corresponding 
function (AND, OR, or NOT) to values on the wire(s) 
that enter it. The output of the circuit on a given 
input is the value of the output gate(s). In this case, 
the circuit computes the XOR function and hence it 
outputs 1 on the input 10. 


x1 


AND gate 


OR gate 


Solution: 

Another way to describe the function ALLEQ is that it outputs 
lonaninputx € {0,1}*ifand onlyifx = Ofors = 1*4. We can 
phrase the conditionz = 1faszg A x, A £ A x which can be 
computed using three AND gates. Similarly we can phrase the con- 
dition x = 04 as To AZ, ^ T3 A T3 which can be computed using four 
NOT gates and three AND gates. The output of ALLEQ is the OR 
of these two conditions, which results in the circuit of 4 NOT gates, 
6 AND gates, and one OR gate presented in Fig. 3.10. 


3.3.1 Boolean circuits: a formal definition 

We defined Boolean circuits informally as obtained by connecting 
AND, OR, and NOT gates via wires so as to produce an output from 
an input. However, to be able to prove theorems about the existence or 
non-existence of Boolean circuits for computing various functions we 


need to: 
1. Formally define a Boolean circuit as a mathematical object. Figure 3.10: A Boolean circuit for computing the all 
equal function ALLEQ : {0,1}4 — {0, 1} that outputs 
2. Formally define what it means for a circuit C to compute a function 1 on æ € {0, 1} if and only if £o = ©) = £3 = x3. 
f. 


We now proceed to do so. We will define a Boolean circuit as a 
labeled Directed Acyclic Graph (DAG). The vertices of the graph corre- 
spond to the gates and inputs of the circuit, and the edges of the graph 
correspond to the wires. A wire from an input or gate u to a gate v in 
the circuit corresponds to a directed edge between the corresponding 
vertices. The inputs are vertices with no incoming edges, while each 
gate has the appropriate number of incoming edges based on the func- 
tion it computes. (That is, AND and OR gates have two in-neighbors, 
while NOT gates have one in-neighbor.) The formal definition is as 
follows (see also Fig. 3.11): 


n inputs 


Inputs: Gates: 


An A/V gate has 2 incoming wires 
and 0 or more outgoing wires. 


| 
syndjno w 


For an A gate, the value on outgoing wire(s) 
is the AND of values on incoming wires. 


Definition 3.5 — Boolean Circuits. Let n,m, s be positive integers with 

s > m. A Boolean circuit with n inputs, m outputs, and s gates, is a 
labeled directed acyclic graph (DAG) G = (V, E) with s+n vertices 
satisfying the following properties: 


e Exactly n of the vertices have no in-neighbors. These vertices 
are known as inputs and are labeled with the n labels X[0], ..., 
X[n — 1]. Each input has at least one out-neighbor. 


e The other s vertices are known as gates. Each gate is labeled with 
A, V or ~. Gates labeled with ^A (AND) or V (OR) have two in- 
neighbors. Gates labeled with — (NOT) have one in-neighbor. 
We will allow parallel edges. 1 


e Exactly m of the gates are also labeled with the m labels Y[0], ..., 
Y[m — 1] (in addition to their label A/V /—). These are known as 
outputs. 


The size of a Boolean circuit is the number s of gates it contains. 
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Figure 3.11: A Boolean Circuit is a labeled directed 
acyclic graph (DAG). It has n input vertices, which are 
marked with X[0],..., X[n — 1] and have no incoming 
edges, and the rest of the vertices are gates. AND, 

OR, and NOT gates have two, two, and one incoming 
edges, respectively. If the circuit has m outputs, then 
m of the gates are known as outputs and are marked 
with Y[0],...,¥[m — 1]. When we evaluate a circuit 
C onan input x € {0, 1}”, we start by setting the 
value of the input vertices to £o, ..., £,_, and then 
propagate the values, assigning to each gate g the 
result of applying the operation of g to the values of 
g's in-neighbors. The output of the circuit is the value 
assigned to the output gates. 


1 Having parallel edges means an AND or OR gate 
u can have both its in-neighbors be the same gate 

v. Since AND(a, a) = OR(a,a) = a for every a € 
{0, 1}, such parallel edges don’t help in computing 
new values in circuits with AND/OR/NOT gates. 
However, we will see circuits with more general sets 
of gates later on. 
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If C is a circuit with n inputs and m outputs, and x € {0,1}”, then 
we can compute the output of C on the input x in the natural way: 
assign the input vertices X[0], ..., X[n — 1] the values x,...,2,_1, 
apply each gate on the values of its in-neighbors, and then output the 
values that correspond to the output vertices. Formally, this is defined 
as follows: 


Definition 3.6 — Computing a function via a Boolean circuit. Let C be a 
Boolean circuit with n inputs and m outputs. For every x € {0,1}”, 
the output of C on the input z, denoted by C (x), is defined as the 
result of the following process: 

We leth : V — Nbe the minimal layering of C (aka topological 
sorting, see Theorem 1.26). We let L be the maximum layer of h, 
and for £ = 0, 1, ... , L we do the following: 


e For every v in the ¢-th layer (i.e., v such that h(v) = £) do: 


- If vis an input vertex labeled with X[i] for somei € [n], then 
we assign to v the value z;. 


- Ifvisa gate vertex labeled with A and with two in-neighbors 
u, w then we assign to v the AND of the values assigned to 
u and w. (Since u and w are in-neighbors of v, they are in a 
lower layer than v, and hence their values have already been 
assigned.) 


- Ifvisa gate vertex labeled with V and with two in-neighbors 
u, w then we assign to v the OR of the values assigned to u 
and w. 


- Ifvisa gate vertex labeled with — and with one in-neighbor u 
then we assign to v the negation of the value assigned to u. 


e The result of this process is the value y € {0,1} such that for 
every j € |[ml,y; is the value assigned to the vertex with label 
Viegas 


Let f : {0,1}" — {0,1}™. We say that the circuit C computes f if 
for every x € {0,1}", C(x) = f(a). 


3.4 STRAIGHT-LINE PROGRAMS 


We have seen two ways to describe how to compute a function f using 
AND, OR and NOT: 


e A Boolean circuit, defined in Definition 3.5, computes f by connect- 
ing via wires AND, OR, and NOT gates to the inputs. 


e We can also describe such a computation using a straight-line 
program that has lines of the form foo = AND(bar,blah), foo = 
OR(bar,blah) and foo = NOT(bar) where foo, bar and blah are 
variable names. (We call this a straight-line program since it contains 
no loops or branching (e.g., if/then) statements.) 


To make the second definition more precise, we will now define a 
programming language that is equivalent to Boolean circuits. We call 
this programming language the AON-CIRC programming language 
(“AON” stands for AND/OR/NOT; “CIRC” stands for circuit). 

For example, the following is an AON-CIRC program that on in- 
putz € {0,1}, outputs Z A z; (ie., the NOT operation applied to 
AND(%p, 21): 


temp = AND(X[0],X[1]) 
YC] = NOT(temp) 


AON-CIRC is not a practical programming language: it was de- 
signed for pedagogical purposes only, as a way to model computation 
as the composition of AND, OR, and NOT. However, it can still be 
easily implemented on a computer. 

Given this example, you might already be able to guess how to 
write a program for computing (for example) £o A x, V £z, and in 


general how to translate a Boolean circuit into an AON-CIRC program. 


However, since we will want to prove mathematical statements about 
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AON-CIRC programs, we will need to precisely define the AON-CIRC 
programming language. Precise specifications of programming lan- 
guages can sometimes be long and tedious,” but are crucial for secure 
and reliable implementations. Luckily, the AON-CIRC programming 
language is simple enough that we can define it formally with rela- 
tively little pain. 


3.4.1 Specification of the AON-CIRC programming language 
An AON-CIRC program is a sequence of strings, which we call 
“lines”, satisfying the following conditions: 


e Every line has one of the following forms: foo = AND(bar,baz), 
foo = OR(bar,baz), or foo = NOT(bar) where foo, bar and baz 
are variable identifiers. (We follow the common programming lan- 
guages convention of using names such as foo, bar, baz as stand- 
ins for generic identifiers.) The line foo = AND(bar,baz) corre- 
sponds to the operation of assigning to the variable foo the logical 
AND of the values of the variables bar and baz. Similarly foo = 
OR(bar,baz) and foo = NOT(bar) correspond to the logical OR 
and logical NOT operations. 


e A variable identifier in the AON-CIRC programming language can 
be any combination of letters, numbers, underscores, and brackets. 
There are two special types of variables: 


- Variables of the form X[i], with i € {0,1,...,n—1} are known as 
input variables. 
- Variables of the form Y[j] are known as output variables. 


e A valid AON-CIRC program P includes input variables of the form 
X[0J,....X[n — 1] and output variables of the form Y[01,..., Yim — 1] 
where n,m are natural numbers. We say that n is the number of 
inputs of the program P and m is the number of outputs. 


e Ina valid AON-CIRC program, in every line the variables on the 
right-hand side of the assignment operator must either be input 
variables or variables that have already been assigned a value ina 
previous line. 


e If Pisa valid AON-CIRC program of n inputs and m outputs, 
then for every x € {0,1}” the output of P on input z is the string 
y € {0, 1}™ defined as follows: 


- Initialize the input variables X[0],...,X[n — 1] to the values 
Toss fyi 

— Run the operator lines of P one by one in order, in each line 
assigning to the variable on the left-hand side of the assignment 


operators the value of the operation on the right-hand side. 


? For example the C programming language specifica- 
tion takes more than 500 pages. 


DEFINING COMPUTATION 


- Lety € {0,1}” be the values of the output variables Y[0],..., 
Y[m — 1] at the end of the execution. 


e We denote the output of P on input x by P(x). 


e The size of an AON circ program P is the number of lines it con- 
tains. (The reader might note that this corresponds to our defini- 
tion of the size of a circuit as the number of gates it contains.) 


Now that we formally specified AON-CIRC programs, we can 
define what it means for an AON-CIRC program P to compute a 
function f: 


Definition 3.8 — Computing a function via AON-CIRC programs. Let f 

{0,1}" — {0,1}, and P be a valid AON-CIRC program with n 
inputs and m outputs. We say that P computes f if P(x) = f(x) for 
every x € {0,1}. 


The following solved exercise gives an example of an AON-CIRC 
program. 


Solved Exercise 3.4 Consider the following function CMP : {0,1}* > 
{0, 1} that on four input bits a,b,c,d € {0,1}, outputs 1 iff the number 
represented by (a, b) is larger than the number represented by (c, d). 
That is CMP(a, b, c,d) = 1 iff 2a + b > 2c + d. 

Write an AON-CIRC program to compute CMP. 


Solution: 

Writing such a program is tedious but not truly hard. To com- 
pare two numbers we first compare their most significant digit, 
and then go down to the next digit and so on and so forth. In this 
case where the numbers have just two binary digits, these compar- 
isons are particularly simple. The number represented by (a,b) is 
larger than the number represented by (c, d) if and only if one of 
the following conditions happens: 


1. The most significant bit a of (a, b) is larger than the most signifi- 
cant bit c of (c, d). 


or 
2. The two most significant bits a and c are equal, but b > d. 


Another way to express the same condition is the following: the 
number (a,b) is larger than (c, d) iff a > c OR (a > c ANDb > d). 
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For binary digits a, 8, the conditiona > (is simply thata = 1 
and 8 = 0 or AND(a, NOT()) = 1, and the condition a > £ is sim- 
ply OR(a, NOT(8)) = 1. Together these observations can be used to 
give the following AON-CIRC program to compute CMP: 


# Compute CMP: {@,1}*4-->{0, 1} 

# CMP(X)=1 iff 2X[@]+X[1] > 2X[2] + X[3] 
temp_1 = NOT(XL2]) 

temp_2 = AND(X[Q@], temp_1) 

temp_3 = OR(X[0], temp_1) 

temp_4 = NOT(X[3]) 

temp_5 = AND(X[1], temp_4) 

temp_6 = AND(temp_5, temp_3) 

Y[0] = OR(temp_2, temp_6) 


We can also present this 8-line program as a circuit with 8 gates, 
see Fig. 3.12. 


x(2] 
3.4.2 Proving equivalence of AON-CIRC programs and Boolean circuits 
We now formally prove that AON-CIRC programs and Boolean cir- xor vor 
cuits have exactly the same power: 
X[3] 
Theorem 3.9 — Equivalence of circuits and straight-line programs. Let 
f : {0,1}" — {0,1}" ands > mbesome number. Then f is 


xI] 


computable by a Boolean circuit with s gates if and only if f is . a. ; 
Figure 3.12: A circuit for computing the CMP function. 


computable by an AON-CIRC program of s lines. The evaluation of this circuit on (1,1, 1,0) yields the 
output 1, since the number 3 (represented in binary 
as 11) is larger than the number 2 (represented in 


Proof Idea: binary as 10). 


The idea is simple - AON-CIRC programs and Boolean circuits 
are just different ways of describing the exact same computational 
process. For example, an AND gate in a Boolean circuit corresponds to 
computing the AND of two previously-computed values. In an AON- 
CIRC program this will correspond to the line that stores in a variable 
the AND of two previously-computed variables. 


* 
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Proof of Theorem 3.9. Let f : {0,1}" — {0,1}. Since the theorem is an 
“if and only if” statement, to prove it we need to show both directions: 
translating an AON-CIRC program that computes f into a circuit that 
computes f, and translating a circuit that computes f into an AON- 
CIRC program that does so. 

We start with the first direction. Let P be an AON-CIRC program 
that computes f. We define a circuit C as follows: the circuit will 
have n inputs and s gates. For every € [s], if the i-th operator line 
has the form foo = AND(bar,blah) then the i-th gate in the circuit 
will be an AND gate that is connected to gates j and k where j and 
k correspond to the last lines before i where the variables bar and 
blah (respectively) were written to. (For example, if i = 57 and the 
last line bar was written to is 35 and the last line blah was written 
to is 17 then the two in-neighbors of gate 57 will be gates 35 and 17.) 
If either bar or blah is an input variable then we connect the gate to 
the corresponding input vertex instead. If foo is an output variable 
of the form Y[j] then we add the same label to the corresponding 
gate to mark it as an output gate. We do the analogous operations if 
the i-th line involves an OR or a NOT operation (except that we use the 
corresponding OR or NOT gate, and in the latter case have only one 
in-neighbor instead of two). For every input x € {0,1}", if we run 
the program P on z, then the value written that is computed in the 
i-th line is exactly the value that will be assigned to the i-th gate if we 
evaluate the circuit C on x. Hence C(x) = P(x) for every x € {0,1}”. 

For the other direction, let C be a circuit of s gates and n inputs that 
computes the function f. We sort the gates according to a topological 
order and write them as vp, ... ,v,_,. We now can create a program 
P of s operator lines as follows. For every i € [s], if v; isan AND 
gate with in-neighbors v,, v;, then we will add a line to P of the form 
temp_i = AND(temp_j, temp_k), unless one of the vertices is an input 
vertex or an output gate, in which case we change this to the form 
XE. ] or YL.] appropriately. Because we work in topological order- 
ing, we are guaranteed that the in-neighbors v; and v, correspond to 
variables that have already been assigned a value. We do the same for 
OR and NOT gates. Once again, one can verify that for every input z, 
the value P(x) will equal C(x) and hence the program computes the 


same function as the circuit. (Note that since C is a valid circuit, per Pree eT 


Definition 3.5, every input vertex of C has at least one out-neighbor pees ORRI XT 

temp_4 = AND(temp_2,temp_3) P 
and there are exactly m output gates labeled 0,... , m — 1; hence all the temp_5 = AND(temp_4,X[2]) <P> 

i 7 P temp_6 = NOT(temp_5) 

variables X[0], ..., X[n — 1] and Y[] ,..., Yim — 1] will appear in the temp_7 = OR(temp_4,X[2]) 

Y[@] = AND(temp_6,temp_7) 
program P.) 

E 


Figure 3.13: Two equivalent descriptions of the same 
AND/OR/NOT computation as both an AON pro- 
gram and a Boolean circuit. 
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3.5 PHYSICAL IMPLEMENTATIONS OF COMPUTING DEVICES (DI- 
GRESSION) 


Computation is an abstract notion that is distinct from its physical im- 
plementations. While most modern computing devices are obtained by 
mapping logical gates to semiconductor-based transistors, throughout 
history people have computed using a huge variety of mechanisms, 
including mechanical systems, gas and liquid (known as fluidics), bi- 
ological and chemical processes, and even living creatures (e.g., see 
Fig. 3.14 or this video for how crabs or slime mold can be used to do 
computations). 

In this section we will review some of these implementations, both 
so you can get an appreciation of how it is possible to directly translate 
Boolean circuits to the physical world, without going through the en- 
tire stack of architecture, operating systems, and compilers, as well as 
to emphasize that silicon-based processors are by no means the only 
way to perform computation. Indeed, as we will see in Chapter 23, 

a very exciting recent line of work involves using different media for 
computation that would allow us to take advantage of quantum me- 
chanical effects to enable different types of algorithms. 

Such a cool way to explain logic gates. pic.twitter.com/6Wgu2ZKFCx 

— Lionel Page (@page_eco) October 28, 2019 


3.5.1 Transistors 

A transistor can be thought of as an electric circuit with two inputs, 
known as the source and the gate and an output, known as the sink. 

The gate controls whether current flows from the source to the sink. In 
a standard transistor, if the gate is “ON” then current can flow from the 
source to the sink and if it is “OFF” then it can’t. In a complementary 
transistor this is reversed: if the gate is “OFF” then current can flow 
from the source to the sink and if it is “ON” then it can’t. 

There are several ways to implement the logic of a transistor. For 
example, we can use faucets to implement it using water pressure 
(e.g. Fig. 3.15). This might seem as merely a curiosity, but there is 
a field known as fluidics concerned with implementing logical op- 
erations using liquids or gasses. Some of the motivations include 
operating in extreme environmental conditions such as in space or a 
battlefield, where standard electronic equipment would not survive. 

The standard implementations of transistors use electrical current. 
One of the original implementations used vacuum tubes. As its name 
implies, a vacuum tube is a tube containing nothing (i.e., a vacuum) 
and where a priori electrons could freely flow from the source (a 
wire) to the sink (a plate). However, there is a gate (a grid) between 
the two, where modulating its voltage can block the flow of electrons. 


Figure 3.14: Crab-based logic gates from the paper 
“Robust soldier-crab ball gate” by Gunji, Nishiyama 
and Adamatzky. This is an example of an AND gate 
that relies on the tendency of two swarms of crabs 
arriving from different directions to combine to a 
single swarm that continues in the average of the 
directions. 


Figure 3.15: We can implement the logic of transistors 
using water. The water pressure from the gate closes 
or opens a faucet between the source and the sink. 


Early vacuum tubes were roughly the size of lightbulbs (and 
looked very much like them too). In the 1950’s they were supplanted 
by transistors, which implement the same logic using semiconduc- 
tors which are materials that normally do not conduct electricity but 
whose conductivity can be modified and controlled by inserting impu- 
rities (“doping”) and applying an external electric field (this is known 
as the field effect). In the 1960’s computers started to be implemented 
using integrated circuits which enabled much greater density. In 1965, 
Gordon Moore predicted that the number of transistors per integrated 
circuit would double every year (see Fig. 3.16), and that this would 
lead to “such wonders as home computers —or at least terminals con- 
nected to a central computer— automatic controls for automobiles, 
and personal portable communications equipment”. Since then, (ad- 
justed versions of) this so-called “Moore’s law” have been running 
strong, though exponential growth cannot be sustained forever, and 
some physical limitations are already becoming apparent. 


3.5.2 Logical gates from transistors 

We can use transistors to implement various Boolean functions such as 
AND, OR, and NOT. For each a two-input gate G : {0,1}? > {0,1}, 
such an implementation would be a system with two input wires x, y 
and one output wire z, such that if we identify high voltage with “1” 
and low voltage with “0”, then the wire z will be equal to “1” if and 
only if applying G to the values of the wires x and y is 1 (see Fig. 3.19 
and Fig. 3.20). This means that if there exists a AND/OR/NOT circuit 
to compute a function g : {0, 1}” — {0,1}, then we can compute g in 
the physical world using transistors as well. 


3.5.3 Biological computing 

Computation can be based on biological or chemical systems. For 
example the lac operon produces the enzymes needed to digest lactose 
only if the conditions x A (~y) hold where z is “lactose is present” 
and y is “glucose is present”. Researchers have managed to create 
transistors, and from them logic gates, based on DNA molecules (see 
also Fig. 3.21). Projects such as the Cello programming language 
enable converting Boolean circuits into DNA sequences that encode 
operations that can be executed in bacterial cells, see this video. One 
motivation for DNA computing is to achieve increased parallelism 

or storage density; another is to create “smart biological agents” that 
could perhaps be injected into bodies, replicate themselves, and fix or 
kill cells that were damaged by a disease such as cancer. Computing 
in biological systems is not restricted, of course, to DNA: even larger 
systems such as flocks of birds can be considered as computational 
processes. 
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Figure 3.16: The number of transistors per integrated 
circuit from 1959 till 1965 and a prediction that ex- 
ponential growth will continue for at least another 
decade. Figure taken from “Cramming More Com- 
ponents onto Integrated Circuits”, Gordon Moore, 
1965 


COSMETICS 


Figure 3.17: Cartoon from Gordon Moore’s article 
“predicting” the implications of radically improving 
transistor density. 


120 Years of Moore’s Law 


Figure 3.18: The exponential growth in computing 
power over the last 120 years. Graph by Steve Jurvet- 
son, extending a prior graph of Ray Kurzweil. 
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Figure 3.19: Implementing logical gates using transis- 
tors. Figure taken from Rory Mangles’ website. 
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3.5.4 Cellular automata and the game of life 

Cellular automata is a model of a system composed of a sequence of 
cells, each of which can have a finite state. At each step, a cell updates 
its state based on the states of its neighboring cells and some simple 
rules. As we will discuss later in this book (see Section 8.4), cellular 
automata such as Conway’s “Game of Life” can be used to simulate 
computation gates. 


3.5.5 Neural networks 

One computation device that we all carry with us is our own brain. 
Brains have served humanity throughout history, doing computations 
that range from distinguishing prey from predators, through making 
scientific discoveries and artistic masterpieces, to composing witty 280 
character messages. The exact working of the brain is still not fully 
understood, but one common mathematical model for it is a (very 
large) neural network. 

A neural network can be thought of as a Boolean circuit that instead 
of AND/OR/NOT uses some other gates as the basic basis. For exam- 
ple, one particular basis we can use are threshold gates. For every vector 
w = (Wo,---,Wp_1) Of integers and integer t (some or all of which 
could be negative), the threshold function corresponding to w,t is the 
function T,, , : {0,1}* — {0,1} that maps x € {0, 1}* to 1 if and only if 
ee w;x; > t. For example, the threshold function T, , correspond- 
ing to w = (1, 1,1,1,1) and t = 3 is simply the majority function MAJ, 
on {0, 1}*. Threshold gates can be thought of as an approximation for 
neuron cells that make up the core of human and animal brains. To a 
first approximation, a neuron has k inputs and a single output, and 
the neuron “fires” or “turns on” its output when those signals pass 
some threshold. 

Many machine learning algorithms use artificial neural networks 
whose purpose is not to imitate biology but rather to perform some 
computational tasks, and hence are not restricted to a threshold or 
other biologically-inspired gates. Generally, a neural network is often 
described as operating on signals that are real numbers, rather than 
0/1 values, and where the output of a gate on inputs zg, ...,7,_, is 
obtained by applying f(>°, w;x;) where f : R > R is an activation 
function such as rectified linear unit (ReLU), Sigmoid, or many others 
(see Fig. 3.23). However, for the purposes of our discussion, all of 
the above are equivalent (see also Exercise 3.13). In particular we can 
reduce the setting of real inputs to binary inputs by representing a 
real number in the binary basis, and multiplying the weight of the bit 
corresponding to the it” digit by 2°. 


Figure 3.20: Implementing a NAND gate (see Sec- 
tion 3.6) using transistors. 
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Figure 3.21: Performance of DNA-based logic gates. 
Figure taken from paper of Bonnet et al, Science, 2013. 
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Figure 3.22: An AND gate using a “Game of Life” 
configuration. Figure taken from Jean-Philippe 
Rennard’s paper. 


3.5.6 A computer made from marbles and pipes 

We can implement computation using many other physical media, 
without any electronic, biological, or chemical components. Many 
suggestions for mechanical computers have been put forward, going 
back at least to Gottfried Leibniz’s computing machines from the 
1670s and Charles Babbage’s 1837 plan for a mechanical “Analytical 
Engine”. As one example, Fig. 3.24 shows a simple implementation of 
a NAND (negation of AND, see Section 3.6) gate using marbles going 
through pipes. We represent a logical value in {0, 1} by a pair of pipes, 
such that there is a marble flowing through exactly one of the pipes. 
We call one of the pipes the “0 pipe” and the other the “1 pipe”, and 
so the identity of the pipe containing the marble determines the logi- 
cal value. A NAND gate corresponds to a mechanical object with two 
pairs of incoming pipes and one pair of outgoing pipes, such that for 
every a,b € {0,1}, if two marbles are rolling toward the object in the 
a pipe of the first pair and the b pipe of the second pair, then a marble 
will roll out of the object in the NAND(a, b)-pipe of the outgoing pair. 
In fact, there is even a commercially-available educational game that 
uses marbles as a basis of computing, see Fig. 3.26. 


3.6 THE NAND FUNCTION 


The NAND function is another simple function that is extremely use- 
ful for defining computation. It is the function mapping {0, 1}? to 
{0, 1} defined by: 


0 a=b=1 
NAND(a, b) = . 
1 otherwise 
As its name implies, NAND is the NOT of AND (i.e., NAND(a, b) = 
NOT(AND(a, b))), and so we can clearly compute NAND using AND 
and NOT. Interestingly, the opposite direction holds as well: 


Theorem 3.10 — NAND computes AND,OR,NOT. We can compute AND, 
OR, and NOT by composing only the NAND function. 


Proof. We start with the following observation. For every a € {0,1}, 
AND(a,a) = a. Hence, NAND(a,a) = NOT(AND(a, a)) = NOT(a). 
This means that NAND can compute NOT. By the principle of “dou- 
ble negation”, AND(a,b) = NOT(NOT(AND(a, b))), and hence 

we can use NAND to compute AND as well. Once we can compute 
AND and NOT, we can compute OR using “De Morgan's Law”: 
OR(a,b) = NOT(AND(NOT(a), NOT(b))) (which can also be writ- 
ten as a V b = T A b) for every a,b € {0,1}. 
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Figure 3.23: Common activation functions used in 
Neural Networks, including rectified linear units 
(ReLU), sigmoids, and hyperbolic tangent. All of 
those can be thought of as continuous approximations 
to simplify the step function. All of these can be used 
to compute the NAND gate (see Exercise 3.13). This 
property enables neural networks to (approximately) 
compute any function that can be computed by a 
Boolean circuit. 


Figure 3.24: A physical implementation of a NAND 
gate using marbles. Each wire in a Boolean circuit is 
modeled by a pair of pipes representing the values 

0 and 1 respectively, and hence a gate has four input 
pipes (two for each logical input) and two output 
pipes. If one of the input pipes representing the value 
0 has a marble in it then that marble will flow to the 
output pipe representing the value 1. (The dashed 
line represents a gadget that will ensure that at most 
one marble is allowed to flow onward in the pipe.) 
If both the input pipes representing the value 1 have 
marbles in them, then the first marble will be stuck 
but the second one will flow onwards to the output 
pipe representing the value 0. 


Figure 3.25: A “gadget” in a pipe that ensures that at 
most one marble can pass through it. The first marble 
that passes causes the barrier to lift and block new 
ones. 
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We can use NAND to compute many other functions, as demon- 


trated in the followi Figure 3.26: The game “Turing Tumble” contains an 
ELA) ENE nO WINE CXETCISE, implementation of logical gates using marbles. 
Solved Exercise 3.5 — Compute majority with NAND. Let MAJ : {0,1}? > 

{0, 1} be the function that on input a, b,c outputs 1 iff a +b + c > 2. 


Show how to compute MAJ using a composition of NAND’s. 


Solution: 
Recall that (3.1) states that 


MAJ(29, £1, £2) = OR (AND (29,21) , OR(AND(@,, 22) , AND(2,22)) ) . 
(3.2) 
We can use Theorem 3.10 to replace all the occurrences of AND 
and OR with NAND’s. Specifically, we can use the equivalence 
AND(a, b) = NOT(NAND(a, b)), OR(a, b) = NAND(NOT(a), NOT(b)), 
and NOT(a) = NAND(a, a) to replace the right-hand side of 
(3.2) with an expression involving only NAND, yielding that 
MAJ (a, b, c) is equivalent to the (somewhat unwieldy) expression 


NAND ( NAND( NAND(NAND(a, b), NAND(a,c)), 
NAND(NAND(a, b), NAND(a, c)) ) 
NAND(b, c) ) 


The same formula can also be expressed as a circuit with NAND 
gates, see Fig. 3.27. 


3.6.1 NAND Circuits 

We define NAND Circuits as circuits in which all the gates are NAND 
operations. Such a circuit again corresponds to a directed acyclic 
graph (DAG) since all the gates correspond to the same function (i.e., 
NAND), we do not even need to label them, and all gates have in- 
degree exactly two. Despite their simplicity, NAND circuits can be 


Figure 3.27: A circuit with NAND gates to compute 
quite powerful. the Majority function on three bits 
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m Example 3.11 — NAN D circuit for XOR. Recall the XOR function 
which maps %,%, € {0,1} tow) + x, mod 2. We have seen in 
Section 3.2.2 that we can compute XOR using AND, OR, and NOT, 
and so by Theorem 3.10 we can compute it using only NAND’s. 
However, the following is a direct construction of computing XOR 
by a sequence of NAND operations: 


. Let u = NAND (ap, 21). 

. Letv = NAND(a, u) 

. Let w = NAND(«a,,u). 

. The XOR of xg and x, is yọ = NAND(v, w). 


BRON 


One can verify that this algorithm does indeed compute XOR 
by enumerating all the four choices for £g, xı € {0,1}. We can also 
represent this algorithm graphically as a circuit, see Fig. 3.28. 


In fact, we can show the following theorem: 


Theorem 3.12 — NAND is a universal operation. For every Boolean circuit 
C of s gates, there exists a NAND circuit C” of at most 3s gates that 


computes the same function as C. xy 
Proof Idea: Figure 3.28: A circuit with NAND gates to compute 
The idea of the proof is to just replace every AND, OR and NOT the XOR of two bits. 
gate with their NAND implementation following the proof of Theo- 
rem 3.10. 
* 


Proof of Theorem 3.12. If C is a Boolean circuit, then since, as we’ve 
seen in the proof of Theorem 3.10, for every a,b € {0,1} 


e NOT(a) = NAND(a, a) 
e AND(a,b) = NAND(NAND(a, b), NAND(a, b)) 
© OR(a,b) = NAND(NAND(a, a), NAND(b, b)) 


we can replace every gate of C with at most three NAND gates to 


obtain an equivalent circuit C’. The resulting circuit will have at most 
3s gates. 
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3.6.2 More examples of NAND circuits (optional) 
Here are some more sophisticated examples of NAND circuits: 


Incrementing integers. Consider the task of computing, given as input 
astring z € {0,1}” that represents a natural number X € N, the 
representation of X + 1. That is, we want to compute the function 
INC, : {0,1}" — {0,1}”*" such that for every Zo, ..., 0,1, INC,,(x) = 
y which satisfies 0" , y;2' = Z x;2') + 1. (For simplicity of 
notation, in this example we use the representation where the least 
significant digit is first rather than last.) 

The increment operation can be very informally described as fol- 
lows: “Add 1 to the least significant bit and propagate the carry”. A little 
more precisely, in the case of the binary representation, to obtain the 
increment of x, we scan x from the least significant bit onwards, and 
flip all 1’s to 0’s until we encounter a bit equal to 0, in which case we 
flip it to 1 and stop. 

Thus we can compute the increment of £o, ...,%,_, by doing the 


following: 


Algorithm 3.13 describes precisely how to compute the increment 
operation, and can be easily transformed into Python code that per- 
forms the same computation, but it does not seem to directly yield 
a NAND circuit to compute this. However, we can transform this 
algorithm line by line to a NAND circuit. For example, since for ev- 
ery a, NAND(a, NOT(a)) = 1, we can replace the initial statement 
Co = 1 with co = NAND(2),NAND(ao, %g)). We already know 
how to compute XOR using NAND and so we can use this to im- 
plement the operation y; + XOR(z,,c;). Similarly, we can write 
the “if” statement as saying c;,, |} AND(c;,x;), or in other words 


C41 + NAND(NAND(c;, x;), NAND(c;, £;)). Finally, the assignment 
Yn = Cy, can be written as y, = NAND(NAND(c,,, €n), NAND(c,,, ¢,))- 
Combining these observations yields for every n € N, a NAND circuit 
to compute INC,,. For example, Fig. 3.29 shows what this circuit looks 
like for n = 4. 


From increment to addition. Once we have the increment operation, 

we can certainly compute addition by repeatedly incrementing (i.e., 
compute x+y by performing INC(x) y times). However, that would be 
quite inefficient and unnecessary. With the same idea of keeping track 
of carries we can implement the “grade-school” addition algorithm 
and compute the function ADD,, : {0,1}?” — {0,1} +! that on 
input x € {0,1}?" outputs the binary representation of the sum of the 


numbers represented by o,...,%,_1 and x,,,...,Van_4! 


Once again, Algorithm 3.14 can be translated into a NAND cir- 
cuit. The crucial observation is that the “if/then” statement simply 
corresponds to ¢;,, + MAJ,(u;,v;,v;) and we have seen in Solved 
Exercise 3.5 that the function MAJ, : {0, 1}? — {0,1} can be computed 
using NANDs. 


3.6.3 The NAND-CIRC Programming language 

Just like we did for Boolean circuits, we can define a programming- 
language analog of NAND circuits. It is even simpler than the AON- 
CIRC language since we only have a single operation. We define the 
NAND-CIRC Programming Language to be a programming language 
where every line (apart from the input/output declaration) has the 
following form: 


foo = NAND(bar, blah) 
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Figure 3.29: NAND circuit with computing the incre- 
ment function on 4 bits. 
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where foo, bar and blah are variable identifiers. 


a Example 3.15 — Our first NAND-CIRC program. Here is an example of a 
NAND.-CIRC program: 


u = NAND(X[9],X[1]) 
v = NAND(X[9],u) 
w = NAND(X[1],u) 
Y[Q] = NAND(v,w) 


Formally, just like we did in Definition 3.8 for AON-CIRC, we can 
define the notion of computation by a NAND-CIRC program in the 
natural way: 


Definition 3.16 — Computing by a NAND-CIRC program. Let f : {0,1}" —> 
{0, 1}” be some function, and let P be a NAND-CIRC program. 
We say that P computes the function f if: 


1. P has n input variables X[0], ...,X[n—1] and m output variables 
Y(0],....Y¥Im — 1]. 


2. For every x € {0, 1}”, if we execute P when we assign to 
X[OJ,...,.Xin — 1] the values x,..., 2,1, then at the end of 
the execution, the output variables Y[0],....Y[m — 1] have the 


values Yo, -.- , Ym—ı Where y = f(x). 


As before we can show that NAND circuits are equivalent to 
NAND-CIRC programs (see Fig. 3.30): 


Theorem 3.17 — NAND circuits and straight-line program equivalence. For 
every f : {0,1}" — {0,1}” ands > m, f is computable by a 
NAND-CIRC program of s lines if and only if f is computable by a 


NAND circuit of s gates. CELE IOUS 
temp_2 = NAND(X[@],temp_1) 
temp_3 = NAND(X[1],temp_1) 

We omit the proof of Theorem 3.17 since it follows along exactly bens pie ee pala 
š ‘ : F temp_6 = NAND(temp_4,temp_5) 
the same lines as the equivalence of Boolean circuits and AON-CIRC temp_7 = NAND(X(2], temp 5) 


Y[@] = NAND(temp_6,temp_7) 


program (Theorem 3.9). Given Theorem 3.17 and Theorem 3.12, we 

know that we can translate every s-line AON-CIRC program P into rigure 50.8 NAN Seeman ET 
an equivalent NAND-CIRC program of at most 3s lines. In fact, this circuit. Note how every line in the program corre- 
translation can be easily done by replacing every line of the form sponds to a gate in the circuit. 


foo = AND(bar,blah), foo = OR(bar,blah) or foo = NOT(bar) 
with the equivalent 1-3 lines that use the NAND operation. Our GitHub 
repository contains a “proof by code”: a simple Python program 
AON2NAND that transforms an AON-CIRC into an equivalent NAND- 
CIRC program. 


3.7 EQUIVALENCE OF ALL THESE MODELS 


If we put together Theorem 3.9, Theorem 3.12, and Theorem 3.17, we 
obtain the following result: 


Theorem 3.19 — Equivalence between models of finite computation. For 
every sufficiently large s,n, mand f : {0,1}" — {0,1}, the 
following conditions are all equivalent to one another: 


e f canbe computed by a Boolean circuit (with ^, V, = gates) of at 
most O(s) gates. 


e f canbe computed by an AON-CIRC straight-line program of at 
most O(s) lines. 


e f canbe computed by a NAND circuit of at most O(s) gates. 


e f canbe computed by a NAND-CIRC straight-line program of at 
most O(s) lines. 


By “O(s)” we mean that the bound is at most c - s where c is a con- 
stant that is independent of n. For example, if f can be computed by a 
Boolean circuit of s gates, then it can be computed by a NAND-CIRC 
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program of at most 3s lines, and if f can be computed by a NAND 
circuit of s gates, then it can be computed by an AON-CIRC program 
of at most 2s lines. 


Proof Idea: 

We omit the formal proof, which is obtained by combining Theo- 
rem 3.9, Theorem 3.12, and Theorem 3.17. The key observation is that 
the results we have seen allow us to translate a program/circuit that 
computes f in one of the above models into a program/circuit that 
computes f in another model by increasing the lines/gates by at most 
a constant factor (in fact this constant factor is at most 3). 

* 


Theorem 3.9 is a special case of a more general result. We can con- 
sider even more general models of computation, where instead of 
AND/OR/NOT or NAND, we use other operations (see Section 3.7.1 
below). It turns out that Boolean circuits are equivalent in power to 
such models as well. The fact that all these different ways to define 
computation lead to equivalent models shows that we are “on the 
right track”. It justifies the seemingly arbitrary choices that we’ve 
made of using AND/OR/NOT or NAND as our basic operations, 
since these choices do not affect the power of our computational 
model. Equivalence results such as Theorem 3.19 mean that we can 
easily translate between Boolean circuits, NAND circuits, NAND- 
CIRC programs and the like. We will use this ability later on in this 
book, often shifting to the most convenient formulation without mak- 
ing a big deal about it. Hence we will not worry too much about the 
distinction between, for example, Boolean circuits and NAND-CIRC 
programs. 

In contrast, we will continue to take special care to distinguish 
between circuits/programs and functions (recall Big Idea 2). A func- 
tion corresponds to a specification of a computational task, and it is 
a fundamentally different object than a program or a circuit, which 
corresponds to the implementation of the task. 


3.7.1 Circuits with other gate sets 

There is nothing special about AND/OR/NOT or NAND. For every 
set of functions G = {Go,...,G,_}, we can define a notion of circuits 
that use elements of G as gates, and a notion of a “G programming 
language” where every line involves assigning to a variable foo the re- 
sult of applying some G, € G to previously defined or input variables. 
Specifically, we can make the following definition: 


Definition 3.20 — General straight-line programs. Let F = {fo,..., fi} 
be a finite collection of Boolean functions, such that f; : {0,1}*: > 
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{0,1} for some k; € N. An F program is a sequence of lines, each of 
which assigns to some variable the result of applying some f; E€ F 
to k; other variables. As above, we use X[i] and Y[j] to denote the 
input and output variables. 

We say that F is a universal set of operations (also known as a uni- 
versal gate set) if there exists a F program to compute the function 
NAND. 


AON-CIRC programs correspond to {AN D, OR, NOT} programs, 
NAND-CIRC programs corresponds to F programs for the set 
F that only contains the NAND function, but we can also define 
{IF, ZERO, ONE} programs (see below), or use any other set. 

We can also define F circuits, which will be directed graphs in 
which each gate corresponds to applying a function f; € F, and will 
each have k; incoming wires and a single outgoing wire. (If the func- 
tion f; is not symmetric, in the sense that the order of its input matters 
then we need to label each wire entering a gate as to which parameter 
of the function it corresponds to.) As in Theorem 3.9, we can show 
that F circuits and F programs are equivalent. We have seen that for 
F = {AND, OR, NOT}, the resulting circuits /programs are equivalent 
in power to the NAND-CIRC programming language, as we can com- 
pute NAND using AND/OR/NOT and vice versa. This turns out to be 
a special case of a general phenomenon — the universality of NAND 
and other gate sets — that we will explore more in-depth later in this 
book. 


a Example 3.21 —IF,ZERO,ONE circuits. Let F = {IF,ZERO,ONE} 
where ZERO : {0,1} — {0}andONE : {0,1} — {1} are the 
constant zero and one functions, 3 and IF : {0,1}3 — {0,1} is the 
function that on input (a,b, c) outputs bifa = 1 and c otherwise. 
Then F is universal. 

Indeed, we can demonstrate that {IF, ZERO, ONE} is universal 
using the following formula for NAND: 


NAND(a, b) = IF(a, IF(b, ZERO, ONE), ONE) . 


A . 3 One can also define these functions as taking a 
There are also some sets F that are more restricted in power. For length zero input. This makes no difference for the 


example it can be shown that if we use only AND or OR gates (with- computational power of the model. 
out NOT) then we do not get an equivalent model of computation. 

The exercises cover several examples of universal and non-universal 

gate sets. 


3.7.2 Specification vs. implementation (again) 
As we discussed in Section 2.6.1, one of the most important distinc- 
tions in this book is that of specification versus implementation or sep- 
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“What” (specification) “How” (implementation) 


RAIA og" Algorithm/Program/Circuit: 
UW, >u, 
Boolean or NAND circuit C, 
AON-CIRC or NAND-CIRC program P 


Output 
and1 = AND(X[@],X[1]) 


and2 = AND(X[1],X[2]) 
and3 = AND(X[@],X[2]) 
ori = OR(and1, and2) 
Y[@] = OR(or1,and3) 


Example: 


[e] 


o 


MAJ: {0,1} > {0,1} 


ejo 


temp = NAND(X[0],X[1]) 
temp2 = NAND(X[@],X[2]) 
temp3 = NAND(X[1],X[2]) 
orl = NAND(temp, temp2) 
temp1 = NAND(or1,or1) 
Y[@] = NAND(temp1,temp3) 


[e] 


e 


eje 


arating “what” from “how” (see Fig. 3.31). A function corresponds 

to the specification of a computational task, that is what output should 
be produced for every particular input. A program (or circuit, or any 
other way to specify algorithms) corresponds to the implementation of 
how to compute the desired output from the input. That is, a program 
is a set of instructions on how to compute the output from the input. 
Even within the same computational model there can be many differ- 
ent ways to compute the same function. For example, there is more 
than one NAND-CIRC program that computes the majority function, 
more than one Boolean circuit to compute the addition function, and 
so on and so forth. 

Confusing specification and implementation (or equivalently func- 
tions and programs) is a common mistake, and one that is unfortu- 
nately encouraged by the common programming-language termi- 
nology of referring to parts of programs as “functions”. However, in 
both the theory and practice of computer science, it is important to 


maintain this distinction, and it is particularly important for us in this 
book. 


Figure 3.31: It is crucial to distinguish between the 
specification of a computational task, namely what is 
the function that is to be computed and the implemen- 
tation of it, namely the algorithm, program, or circuit 
that contains the instructions defining how to map 

an input to an output. The same function could be 
computed in many different ways. 


e We can use NAND to compute many other func- 
tions, including majority, increment, and others. 

e There are other equivalent choices, including the 
sets {AN D, OR, NOT} and {IF, ZERO, ONE}. 

e We can formally define the notion of a function 
F : {0,1}" — {0,1}™ being computable using the 
NAND-CIRC Programming language. 

e For every set of basic operations, the notions of be- 
ing computable by a circuit and being computable 
by a straight-line program are equivalent. 


3.8 EXERCISES 


Exercise 3.1 — Compare 4 bit numbers. Give a Boolean circuit 

(with AND/OR/NOT gates) that computes the function 

CMP, : {0,1}® > {0,1} such that CMP (ap, a1, 2, @3, bo, b1, b2, b3) = 1 
if and only if the number represented by aya, aza; is larger than the 
number represented by b)b, babs. 


Exercise 3.2 — Compare n bit numbers. Prove that there exists a constant c 
such that for every n there is a Boolean circuit (with AND/OR/NOT 
gates) C of at most c - n gates that computes the function CMP}, : 
{0,1}?" — {0,1} such that CMP,,, (ag --a,_109°-b,_1) = lifand 
only if the number represented by ay ---a,,_; is larger than the number 
represented by bg --:b,_1. 

E 


Exercise 3.3 — OR,NOT is universal. Prove that the set {OR, NOT} is univer- 
sal, in the sense that one can compute NAND using these gates. 


| 
Exercise 3.4 — AND,OR is not universal. Prove that for every n-bit input 
circuit C that contains only AND and OR gates, as well as gates that 
compute the constant functions 0 and 1, C is monotone, in the sense 
that if x, x’ € {0,1}", x; < x; for every i € [n], then C(x) < C(x’). 
Conclude that the set {AND, OR, 0, 1} is not universal. 
a 


Exercise 3.5 — XOR is not universal. Prove that for every n-bit input circuit 
C that contains only XOR gates, as well as gates that compute the 
constant functions 0 and 1, C is affine or linear modulo two, in the sense 
that there exists some a € {0,1}” and b € {0,1} such that for every 
z € {0,1}", C(x) = Z? az, +b mod 2. 

Conclude that the set {XOR, 0, 1} is not universal. 
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Exercise 3.6 — MAUJ,NOT, 1 is universal. Let MAJ : {0,1}% — {0,1} be the 
majority function. Prove that {MAJ, NOT, 1} is a universal set of gates. 


a 
Exercise 3.7 — MAJ,NOT is not universal. Prove that {MAJ, NOT} is nota 
universal set. See footnote for hint.* 

E 
Exercise 3.8 — NOR is universal. Let NOR : {0,1}? — {0,1} defined as 
NOR(a, b) = NOT(OR(a, b)). Prove that {NOR} is a universal set of 
gates. 

E 
Exercise 3.9 — Lookup is universal. Prove that {LOOKUP}, 0, 1} is a uni- 
versal set of gates where 0 and 1 are the constant functions and 
LOOKUP, : {0,1} — {0,1} satisfies LOOKUP, (a,b, c) equals a if 
c = 0 and equals b if c = 1. 

E 


Exercise 3.10 — Bound on universal basis size (challenge). Prove that for ev- 
ery subset B of the functions from {0,1}* to {0, 1}, if B is universal 
then there is a B-circuit of at most O(1) gates to compute the NAND 
function (you can start by showing that there is a B circuit of at most 
O(k1°) gates).° 


m 
Exercise 3.11 — Size and inputs / outputs. Prove that for every NAND cir- 
cuit of size s with n inputs and m outputs, s > min{n/2, m}. See 
footnote for hint.® 

m 


Exercise 3.12 — Threshold using NANDs. Prove that there is some constant 

c such that for every n > 1, and integers ay,...,a@,_1,b E {—2", —2” + 
1, ...,—1, 0, +1, ... , 2” }, there is a NAND circuit with at most cn‘ gates 
that computes the threshold function fa, ,.a„_,,b ` {0,1}" — {0,1} that 
on input x € {0, 1}” outputs 1 if and only if DDA a,x, >b. 


Exercise 3.13 — NANDs from activation functions. We say that a function 
f : R? + R is a NAND approximator if it has the following property: for 
every a,b € R, if min{|a|, |1 — a|} < 1/3 and min{|6|, |1 — b|} < 0.1 then 
|f(a,b) — NAND({a], |b])| < 0.1 where we denote by |x] the integer 
closest to x. That is, if a, b are within a distance 1/3 to {0,1} then we 
want f(a, b) to equal the NAND of the values in {0, 1} that are closest 
to a and b respectively. Otherwise, we do not care what the output of 
fisonaand b. 

In this exercise you will show that you can construct a NAND ap- 
proximator from many common activation functions used in deep 


t Hint: Use the fact that MAJ(a, b,c) = MAJ(a, b,c) 
to prove that every f : {0,1}” — {0,1} computable 
by a circuit with only MAJ and NOT gates satisfies 
f(0,0,...,0) Æ f(1,1,...,1). Thanks to Nathan 
Brunelle and David Evans for suggesting this exercise. 


5 Thanks to Alec Sun and Simon Fischer for comments 
on this problem. 


6 Hint: Use the conditions of Definition 3.5 stipulating 
that every input vertex has at least one out-neighbor 
and there are exactly m output gates. See also Re- 
mark 3.7. 


neural networks. As a corollary you will obtain that deep neural net- 
works can simulate NAND circuits. Since NAND circuits can also 
simulate deep neural networks, these two computational models are 
equivalent to one another. 


1. Show that there isa NAND approximator f defined as f(a,b) = 
L(DReLU(L’(a,b))) where L’ : R? + R is an affine function (of the 
form L’(a,b) = aa + b+ y for some a, 8,7 € R), Lis an affine 
function (of the form L(y) = ay + £ for a, 8 € R), and DReLU : 

R —> R, is the function defined as DReLU (x) = min(1,max(0,x)). 
Note that DReLU (x) = 1 — ReLU(1 — ReLU(x)) where ReLU(x) = 
max(z, 0) is the rectified linear unit activation function. 


2. Show that there is a NAND approximator f defined as f(a,b) = 
L(sigmoid(L’(a, b))) where L’, L are affine as above and sigmoid : 
R — Ris the function defined as sigmoid(x) = e*/(e* + 1). 


3. Show that there isa NAND approximator f defined as f(a,b) = 
L(tanh(L'(a, b))) where L’, L are affine as above and tanh : R > R 
is the function defined as tanh(x) = (e” —e-*)/(e* +e”). 


4. Prove that for every NAND-circuit C with n inputs and one output 
that computes a function g : {0,1}” — {0,1}, if we replace every 
gate of C with a NAND-approximator and then invoke the result- 
ing circuit on some z € {0,1}”, the output will be a number y such 
that |y — g(x)| < 1/3. 
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7 One approach to solve this is using recursion and 


n 
Exercise 3.14 — Majority with NANDs efficiently. Prove that there is some 
constant c such that for every n > 1, there is a NAND circuit of at 
most c - n gates that computes the majority function on n input bits 
MAJ, : {0,1}" — {0,1}. That is MAJ (x) = 1 iff = z; > n/2. See analyzing it using the so called “Master Theorem”. 
footnote for hint.” 

n 


Exercise 3.15 — Output at last layer. Prove that for every f : {0,1}" — 
{0, 1}, if there is a Boolean circuit C of s gates that computes f then 
there is a Boolean circuit C” of at most s gates such that in the minimal 


8 Hint: Vertices in layers beyond the output can be 


layering of C’, the output gate of C’ is placed in the last layer. See safely removed without changing the functionality of 


footnote for hint.® the circuit. 


3.9 BIOGRAPHICAL NOTES 


The excerpt from Al-Khwarizmi’s book is from “The Algebra of Ben- 
Musa”, Fredric Rosen, 1831. 
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Charles Babbage (1791-1871) was a visionary scientist, mathemati- 
cian, and inventor (see [Swa02; CM00]). More than a century before 
the invention of modern electronic computers, Babbage realized that 
computation can be in principle mechanized. His first design for a 
mechanical computer was the difference engine that was designed to do 
polynomial interpolation. He then designed the analytical engine which 
was a much more general machine and the first prototype for a pro- 
grammable general-purpose computer. Unfortunately, Babbage was 
never able to complete the design of his prototypes. One of the earliest 
people to realize the engine’s potential and far-reaching implications 
was Ada Lovelace (see the notes for Chapter 7). 

Boolean algebra was first investigated by Boole and DeMorgan 
in the 1840’s [Boo47; De 47]. The definition of Boolean circuits and 
connection to electrical relay circuits was given in Shannon’s Masters 
Thesis [Sha38]. (Howard Gardener called Shannon’s thesis “possibly 
the most important, and also the most famous, master’s thesis of the 
[20th] century”.) Savage’s book [Sav98], like this one, introduces 
the theory of computation starting with Boolean circuits as the first 
model. Jukna’s book [Juk12] contains a modern in-depth exposition of 
Boolean circuits, see also [| Weg87]. 

The NAND function was shown to be universal by Sheffer [She13], 
though this also appears in the earlier work of Peirce, see [Bur78]. 
Whitehead and Russell used NAND as the basis for their logic in 
their magnum opus Principia Mathematica [WR12]. In her Ph.D thesis, 
Ernst [Ern09] investigates empirically the minimal NAND circuits 
for various functions. Nisan and Shocken’s book [NS05] builds a 
computing system starting from NAND gates and ending with high- 
level programs and games (“NAND to Tetris”); see also the website 
nandtotetris.org. 

We defined the size of a Boolean circuit in Definition 3.5 to be the 
number of gates it contains. This is one of two conventions used in the 
literature. The other convention is to define the size as the number of 
wires (equivalent to the number of gates plus the number of inputs). 
This makes very little difference in almost all settings, but can affect 
the circuit size complexity of some “pathological examples” of func- 
tions such as the constant zero function that do not depend on much 
of their inputs. 


Learning Objectives: 


e Get comfortable with syntactic sugar or 
automatic translation of higher-level logic to 
low-level gates. 


Learn proof of major result: every finite 
function can be computed by a Boolean 
circuit. 


Start thinking quantitatively about the 
number of lines required for computation. 


4 
Syntactic sugar, and computing every function 


“(In 1951] I had a running compiler and nobody would touch it because, 
they carefully told me, computers could only do arithmetic; they could not do 
programs.”, Grace Murray Hopper, 1986. 


“Syntactic sugar causes cancer of the semicolon.”, Alan Perlis, 1982. 


The computational models we considered thus far are as “bare 
bones” as they come. For example, our NAND-CIRC “programming 
language” has only the single operation foo = NAND(bar,blah). In 
this chapter we will see that these simple models are actually equiv- 
alent to more sophisticated ones. The key observation is that we can 
implement more complex features using our basic building blocks, 
and then use these new features themselves as building blocks for 
even more sophisticated features. This is known as “syntactic sugar” 
in the field of programming language design since we are not modi- 
fying the underlying programming model itself, but rather we merely 
implement new features by syntactically transforming a program that 
uses such features into one that doesn’t. 

This chapter provides a “toolkit” that can be used to show that 
many functions can be computed by NAND-CIRC programs, and 
hence also by Boolean circuits. We will also use this toolkit to prove 
a fundamental theorem: every finite function f : {0,1}" — {0,1}” 
can be computed by a Boolean circuit, see Theorem 4.13 below. While 
the syntactic sugar toolkit is important in its own right, Theorem 4.13 
can also be proven directly without using this toolkit. We present this 
alternative proof in Section 4.5. See Fig. 4.1 for an outline of the results 
of this chapter. 


Compiled on 12.19.2022 22:58 


160 INTRODUCTION TO THEORETICAL COMPUTER SCIENCE 


Syntactic sugar 


Functions/ Conditionals: Bounded loops 
Macros if/then 


Compute the LOOKUP function 


Compute every finite function 


Figure 4.1: An outline of the results of this chapter. In 
Section 4.1 we give a toolkit of “syntactic sugar” trans- 
formations showing how to implement features such 
as programmer-defined functions and conditional 
statements in NAND-CIRC. We use these tools in 
Section 4.3 to give a NAND-CIRC program (or alter- 
natively a Boolean circuit) to compute the LOOKUP 
function. We then build on this result to show in Sec- 
tion 4.4 that NAND-CIRC programs (or equivalently, 
Boolean circuits) can compute every finite function. 
An alternative direct proof of the same result is given 
in Section 4.5. 


SYNTACTIC SUGAR, AND COMPUTING EVERY FUNCTION 


4.1 SOME EXAMPLES OF SYNTACTIC SUGAR 


We now present some examples of “syntactic sugar” transformations 
that we can use in constructing straightline programs or circuits. We 
focus on the straight-line programming language view of our computa- 
tional models, and specifically (for the sake of concreteness) on the 
NAND-CIRC programming language. This is convenient because 
many of the syntactic sugar transformations we present are easiest to 
think about in terms of applying “search and replace” operations to 
the source code of a program. However, by Theorem 3.19, all of our 
results hold equally well for circuits, whether ones using NAND gates 
or Boolean circuits that use the AND, OR, and NOT operations. Enu- 
merating the examples of such syntactic sugar transformations can be 
a little tedious, but we do it for two reasons: 


1. To convince you that despite their seeming simplicity and limita- 
tions, simple models such as Boolean circuits or the NAND-CIRC 
programming language are actually quite powerful. 


2. So you can realize how lucky you are to be taking a theory of com- 
putation course and not a compilers course... : ) 


4.1.1 User-defined procedures 

One staple of almost any programming language is the ability to 
define and then execute procedures or subroutines. (These are often 
known as functions in some programming languages, but we prefer 
the name procedures to avoid confusion with the function that a pro- 
gram computes.) The NAND-CIRC programming language does 
not have this mechanism built in. However, we can achieve the same 
effect using the time-honored technique of “copy and paste”. Specifi- 
cally, we can replace code which defines a procedure such as 


def Proc(a,b): 


proc_code 
return c 
some_code 
f = Proc(d,e) 


some_more_code 


with the following code where we “paste” the code of Proc 
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some_code 
proc_code' 
some_more_code 


and where proc_code' is obtained by replacing all occurrences 
of a with d, b with e, and c with f. When doing that we will need to 
ensure that all other variables appearing in proc_code' don’t interfere 
with other variables. We can always do so by renaming variables to 
new names that were not used before. The above reasoning leads to 
the proof of the following theorem: 


Theorem 4.1 — Procedure definition syntatic sugar. Let NAND-CIRC- 
PROC be the programming language NAND-CIRC augmented 
with the syntax above for defining procedures. Then for every 
NAND-CIRC-PROC program P, there exists a standard (i.e., 
“sugar-free” NAND-CIRC program P’ that computes the same 
function as P. 


Theorem 4.1 can be proven using the transformation above, but 


since the formal proof is somewhat long and tedious, we omit it here. 


m Example 4.3 — Computing Majority from NAND using syntactic sugar. Pro- 
cedures allow us to express NAND-CIRC programs much more 
cleanly and succinctly. For example, because we can compute 
AND, OR, and NOT using NANDs, we can compute the Majority 
function as follows: 


def NOT(a): 
return NAND(a, a) 
def AND(a,b): 
temp = NAND(a,b) 
return NOT(temp) 
def OR(a,b): 
temp1 = NOT(a) 
temp2 = NOT(b) 
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return NAND(temp1, temp2) 


def MAJ(a,b,c): 


and1 = AND(a,b) 
and2 = AND(a,c) 
and3 = AND(b,c) 


orl = OR(and1,and2) 
return OR(or1,and3) 


print(MAJ(0,1,1)) 
#1 


Fig. 4.2 presents the “sugar-free” NAND-CIRC program (and 
the corresponding circuit) that is obtained by “expanding out” this 
program, replacing the calls to procedures with their definitions. 


Figure 4.2: A standard (i.e., “sugar-free”) NAND- 
CIRC program that is obtained by expanding out the 


temp = NAND(X[0],X[1]) procedure definitions in the program for Majority 

and1 = NAND(temp, temp) of Example 4.3. The corresponding circuit is on 

ber = ao a AA A the right. Note that this is not the most efficient 

Teno - NAND Or xp NAND circuit/program for majority: we can save on 

and3 = NAND(temp, temp) some gates by “shortcutting” steps where a gate u 

temp1 = NAND(and1,and1) computes NAND(v, v) and then a gate w computes 
P 5 p 

temp2 = aaa pe NAND(u, u) (as indicated by the dashed green 

orl = NAND(temp1,temp2 f $ 

temp = NAND(or1,or1) arrows in the above figure). 

temp2 = NAND(and3,and3) 


Y[@] = NAND(temp1,temp2) 
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4.1.2 Proof by Python (optional) 

We can write a Python program that implements the proof of Theo- 
rem 4.1. This is a Python program that takes a NAND-CIRC-PROC 
program P that includes procedure definitions and uses simple 
“search and replace” to transform P into a standard (i.e., “sugar- 
free”) NAND-CIRC program P’ that computes the same function as 
P without using any procedures. The idea is simple: if the program P 
contains a definition of a procedure Proc of two arguments x and y, 
then whenever we see a line of the form foo = Proc(bar,blah), we 
can replace this line by: 


1. The body of the procedure Proc (replacing all occurrences of x and 
y with bar and blah respectively). 


2. Aline foo = exp, where exp is the expression following the re- 
turn statement in the definition of the procedure Proc. 


To make this more robust we add a prefix to the internal variables 
used by Proc to ensure they don’t conflict with the variables of P; 
for simplicity we ignore this issue in the code below though it can be 
easily added. 

The code of the Python function desugar below achieves such a 
transformation. 

Fig. 4.2 shows the result of applying desugar to the program of Ex- 
ample 4.3 that uses syntactic sugar to compute the Majority function. 
Specifically, we first apply desugar to remove usage of the OR func- 
tion, then apply it to remove usage of the AND function, and finally 


apply it a third time to remove usage of the NOT function. 
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Figure 4.3: Python code for transforming NAND-CIRC-PROC programs into standard sugar-free NAND-CIRC programs. 


def desugar(code, func_name, func_args, func_body): 
Replaces all occurences of 
foo = func_name(func_args) 
with 
func_body[x->a, y->b] 
foo = [result returned in func_body] 
# Uses Python regular expressions to simplify the search and replace, 
# see https://docs.python.org/3/library/re.html and Chapter 9 of the book 


# regular expression for capturing a list of variable names separated by commas 
arglist = ",".join(£r"(La-zA-Z@-9\_\[\]]+)" for i in range(len(func_args))]) 
# regular expression for capturing a statement of the form 
# "variable = func_name(arguments)" 
regexp = fr'([La-zA-ZQ-9\_\[\]]+)\s*=\s*{func_name}\ ({arglist}\)\s*$' 
while True: 
m = re.search(regexp, code, re.MULTILINE) 
if not m: break 
newcode = func_body 
# replace function arguments by the variables from the function invocation 
for i in range(len(func_args)): 
newcode = newcode.replace(func_args[i], m.group(i+2)) 
# Splice the new code inside 
newcode = newcode.replace('return', m.group(1) + "= ") 
code = code[:m.start()] + newcode + code[m.end()+1:] 
return code 
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4.1.3 Conditional statements 

Another sorely missing feature in NAND-CIRC is a conditional 
statement such as the if/then constructs that are found in many 
programming languages. However, using procedures, we can ob- 
tain an ersatz if/then construct. First we can compute the function 

IF : {0,1}? + {0,1} such that IF(a, b, c) equals b if a = 1 and c if a = 0. 


The IF function can be implemented from NANDs as follows (see 


Exercise 4.2): 


def IF(cond,a,b): 
notcond = NAND(cond, cond) 
temp = NAND(b,notcond) 
temp1 = NAND(a, cond) 
return NAND(temp, temp1) 


The IF function is also known as a multiplexing function, since cond 
can be thought of as a switch that controls whether the output is con- 
nected to a or b. Once we have a procedure for computing the IF func- 
tion, we can implement conditionals in NAND. The idea is that we 
replace code of the form 


if (condition): assign blah to variable foo 
with code of the form 
foo = IF(condition, blah, foo) 


that assigns to foo its old value when condition equals 0, and 
assign to foo the value of blah otherwise. More generally we can 
replace code of the form 


if (cond): 
a=... 
b=... 
CS. 264 
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with code of the form 


temp_a =... 


temp_b 
temp_c = . 


a IF (cond, temp_a, a) 
b = IF(cond, temp_b,b) 
IF (cond, temp_c,c) 


ll 


c 


Using such transformations, we can prove the following theorem. 
Once again we omit the (not too insightful) full formal proof, though 
see Section 4.1.2 for some hints on how to obtain it. 


Theorem 4.6 — Conditional statements syntactic sugar. Let NAND-CIRC- 
IF be the programming language NAND-CIRC augmented with 
if/then/else statements for allowing code to be conditionally 
executed based on whether a variable is equal to 0 or 1. 

Then for every NAND-CIRC-IF program P, there exists a stan- 
dard (ie., “sugar-free”) NAND-CIRC program P’ that computes 
the same function as P. 


4.2 EXTENDED EXAMPLE: ADDITION AND MULTIPLICATION (OP- 
TIONAL) 


Using “syntactic sugar”, we can write the integer addition function as 
follows: 


# Add two n-bit integers 
# Use LSB first notation for simplicity 
def ADD(A,B): 
Result = [0]*(n+1) 
Carry = [0]*(n+1) 
CarryL0] = zero(ALQ]) 
for i in range(n): 
Result[i] = XOR(CarryLi],XOR(ALi],BLiJ)) 
CarryLit1] = MAJ(Carryli],ALil],BLil) 
Result[n] = Carry[n] 
return Result 


ADD(L1,1,1,9,0],[1,0,0,0,0]);; 
#[0, 0, 0, 1, 0, @] 


where zero is the constant zero function, and MAJ and XOR corre- 
spond to the majority and XOR functions respectively. While we use 
Python syntax for convenience, in this example n is some fixed integer 
and so for every such n, ADD is a finite function that takes as input 2n 
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bits and outputs n + 1 bits. In particular for every n we can remove 
the loop construct for i in range(n) by simply repeating the code n 
times, replacing the value of i with 0,1, 2,...,n — 1. By expanding out 
all the features, for every value of n we can translate the above pro- 
gram into a standard (“sugar-free”) NAND-CIRC program. Fig. 4.4 
depicts what we get for n = 2. 


Ter 
Temp[18] = 
Tenp[19] = 
a] = 


nanp(x[2],x03]) 
NAND(X[1], Temp[23]) 
NAND(X[3], Temp[23]) 


ae] = NAN 
2 Temp[ a1] = NAN 
43 [2] = NAND(Temp[20], Tenp[42]) aa a 


By going through the above program carefully and accounting for 
the number of gates, we can see that it yields a proof of the following 
theorem (see also Fig. 4.5): 


Theorem 4.7 — Addition using NAND-CIRC programs. For everyn € N, 
let ADD,, {0,1} —  {0,1}"*! be the function that, given 
x,x’ € {0,1}” computes the representation of the sum of the num- 
bers that x and x’ represent. Then there is a constante < 30 such 
that for every n there is a NAND-CIRC program of at most cn lines 
computing ADD,,. ! 


Once we have addition, we can use the grade-school algorithm to 
obtain multiplication as well, thus obtaining the following theorem: 


Theorem 4.8 — Multiplication using NAND-CIRC programs. For every n, 

let MULT, {0,1}? — {0,1}°” be the function that, given 
x,x° € {0,1}” computes the representation of the product of the 
numbers that x and x’ represent. Then there is a constant c such 
that for every n, there is a NAND-CIRC program of at most cn? 


lines that computes the function MULT,,. 


We omit the proof, though in Exercise 4.7 we ask you to supply 
a “constructive proof” in the form of a program (in your favorite 


Figure 4.4: The NAND-CIRC program and corre- 
sponding NAND circuit for adding two-digit binary 
numbers that are obtained by “expanding out” all the 
syntactic sugar. The program/circuit has 43 lines/- 
gates which is by no means necessary. It is possible 
to add n bit numbers using 9n NAND gates, see 


Exercise 4.5. 


1 The value of c can be improved to 9, see Exercise 4.5. 
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Figure 4.5: The number of lines in our NAND-CIRC 
program to add two n bit numbers, as a function of 

n, for n’s between 1 and 100. This is not the most 
efficient program for this task, but the important point 


is that it has the form O(n). 
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programming language) that on input a number n, outputs the code 
of a NAND-CIRC program of at most 1000n? lines that computes the 
MULT, function. In fact, we can use Karatsuba’s algorithm to show 
that there is a NAND-CIRC program of O(n!°82°) lines to compute 
MULT, (and can get even further asymptotic improvements using 
better algorithms). 


4.3 THE LOOKUP FUNCTION 


The LOOKUP function will play an important role in this chapter and 
later. It is defined as follows: 


Definition 4.9 — Lookup function. For every k, the lookup function of 
order k, LOOKUP, : {0,1}2"+* — {0,1} is defined as follows: For 
every x € {0,1}?" andi € {0,1}*, 


LOOKUP, (a, 1) = z; 


where x; denotes the it” entry of x, using the binary representation 
to identify i with a number in {0,...,2* — 1}. 


2k k 


LOOKUP;,(x, i) = x; 


See Fig. 4.6 for an illustration of the LOOKUP function. It turns 
out that for every k, we can compute LOOKUP, using a NAND-CIRC 
program: 


Theorem 4.10 — Lookup function. For every k > 0, there isa NAND- 
CIRC program that computes the function LOOKUP, : {0,1}2°+* > 
{0, 1}. Moreover, the number of lines in this program is at most 
ELE 


An immediate corollary of Theorem 4.10 is that for every k > 0, 
LOOKUP, can be computed by a Boolean circuit (with AND, OR and 
NOT gates) of at most 8 - 2" gates. 


4.3.1 Constructing a NAND-CIRC program for LOOKUP 
We prove Theorem 4.10 by induction. For the case k = 1, LOOKUP, 
maps (29,2,7) € {0,1}? to z,. In other words, if i = 0 then it outputs 


Figure 4.6: The LOOKUP, function takes an input 

in {0, ppt, which we denote by x, i (with x € 
{0,1}2" and i € {0,1}*). The output is x;: the i-th 
coordinate of x, where we identify i as a number 

in [k] using the binary representation. In the above 
example z € {0,1}1° and i € {0, 1}4. Since i = 0110 
is the binary representation of the number 6, the 
output of LOOKUP, (x, i) in this case is xg = 1. 
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£o and otherwise it outputs x,, which (up to reordering variables) is 
the same as the IF function presented in Section 4.1.3, which can be 
computed by a 4-line NAND-CIRC program. 

As a warm-up for the case of general k, let us consider the case 
ofk = 2. Given input z = (£9, %1, Z2, £3) for LOOKUP, and an 
indexi = (io, i1), if the most significant bit ig of the index is 0 then 
LOOKUP, (zx, i) will equal zy if i4 = 0 and equal x, ifi, = 1. Similarly, 
if the most significant bit i, is 1 then LOOKUP, (x, i) will equal z, if 
i, = Oand will equal x; if i} = 1. Another way to say this is that we 
can write LOOKUP, as follows: 


def LOOKUP2(X[0],XL1],X02],X£31,i100],i[1]): 
if i[0]==1: 
return LOOKUP1(X[2],X03],if1]) 
else: 
return LOOKUP1(X[0],XC11],if1]) 


or in other words, 


def LOOKUP2(X[0],X[1],X[2],X[3], i[0],i[1]): 
a = LOOKUP1(X[2],X[3],i[1]) 
b = LOOKUP1(X[@],X£1J,i01]) 
return IF( i[0],a,b) 


More generally, as shown in the following lemma, we can compute 
LOOKUP, using two invocations of LOOKUP,,_, and one invocation 
of IF: 


Lemma 4.11 — Lookup recursion. For every k > 2, LOOKUP, (9, ... , £or—1; Ig, +++ 5 tn_-1) 
is equal to 


IF (ig, LOOKUP „4 (Zor; ic High ay İs o sl pq) LOOKUP eA ails Zorii, i1 3 ig1)) 


Proof. If the most significant bit 7, of i is zero, then the index i is 
in {0,...,2*-! — 1} and hence we can perform the lookup on the 
“first half” of x and the result of LOOKUP, (x, i) will be the same as 
a = LOOKUP ,—1(£o;--. , Uge-1_15 44, +, 4n_1)- On the other hand, if this 
most significant bit ig is equal to 1, then the index is in {2*~1, ...,2* — 
1}, in which case the result of LOOKUP,,(, i) is the same as b = 
LOOKUP,,_1 (@on-1, +, Lox_1, 11, -+-, 441). Thus we can compute 
LOOKUP,,(a, i) by first computing a and b and then outputting 
IF(i,_1, 4, b). 

a 


Proof of Theorem 4.10 from Lemma 4.11. Now that we have Lemma 4.11, 
we can complete the proof of Theorem 4.10. We will prove by induc- 
tion on k that there is a NAND-CIRC program of at most 4 - (2¥ — 1) 
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lines for LOOKUP,,. For k = 1 this follows by the four line program for 
IF we've seen before. For k > 1, we use the following pseudocode: 


a = LOOKUP_(k-1)(X[0],...,X£2*(k-1)-1],if1],...,if[k-1]) 
= LOOKUP_(k-1)(X[2*(k-1)],...,Z02*(k-1)],if1],...,iLk- 
3 11) 


return IF(i[0],b,a) 


If we let L(k) be the number of lines required for LOOKUP,,, then 
the above pseudo-code shows that 


L(k) < 2L(k—1) +4. (4.1) 


Since under our induction hypothesis L(k — 1) < 4(2*"1 — 1), we get 
that L(k) < 2-4(2*-1 — 1) + 4 = 4(2* — 1) which is what we wanted 
to prove. See Fig. 4.7 for a plot of the actual number of lines in our 
implementation of LOOKUP,. 


4.4 COMPUTING EVERY FUNCTION 


At this point we know the following facts about NAND-CIRC pro- / 
grams (and so equivalently about Boolean circuits and our other i 
equivalent models): 100 Pa 


1. They can compute at least some non-trivial functions. 


4 
index length 
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2. Coming up with NAND-CIRC programs for various functions is a Figure 4.7: The number of lines in our implementation 
very tedious task. of the LOOKUP_k function as a function of k (i.e., the 


length of the index). The number of lines in our 
Thus I would not blame the reader if they were not particularly lee Saadeh a2" 
looking forward to a long sequence of examples of functions that can 
be computed by NAND-CIRC programs. However, it turns out we are 
not going to need this, as we can show in one fell swoop that NAND- 


CIRC programs can compute every finite function: 


Theorem 4.12 — Universality of NAND. There exists some constantc > 0 
such that for every n,m > Oand function f : {0,1}" — {0,1}, 
there is a NAND-CIRC program with at most c-m2” lines that com- 
putes the function f . 


By Theorem 3.19, the models of NAND circuits, NAND-CIRC pro- 
grams, AON-CIRC programs, and Boolean circuits, are all equivalent 
to one another, and hence Theorem 4.12 holds for all these models. In 
particular, the following theorem is equivalent to Theorem 4.12: 


Theorem 4.13 — Universality of Boolean circuits. There exists some 
constantc > Osuchthatforeveryn,m >  Oand function 
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f =: {0,1}" — {0,1}, there is a Boolean circuit with at most 
c-m2” gates that computes the function f . 


Improved bounds. Though it will not be of great importance to us, it 
is possible to improve on the proof of Theorem 4.12 and shave an extra 
factor of n, as well as optimize the constant c, and so prove that for 
every € > 0,m € Nand sufficiently large n, if f : {0,1}" — {0,1}” 
then f can be computed by a NAND circuit of at most (1 + e) = 
gates. The proof of this result is beyond the scope of this book, but we 
do discuss how to obtain a bound of the form O( 72) in Section 4.4.2; 
see also the biographical notes. 


4.4.1 Proof of NAND’s Universality 

To prove Theorem 4.12, we need to give a NAND circuit, or equiva- 
lently a NAND-CIRC program, for every possible function. We will 
restrict our attention to the case of Boolean functions (i.e, m = 1). 
Exercise 4.9 asks you to extend the proof for all values of m. A func- 
tion F : {0,1}" — {0,1} can be specified by a table of its values for 
each one of the 2” inputs. For example, the table below describes one 
particular function G : {0,1}+ > {0,1}: 


? Tn case you are curious, this is the function on input 
i € {0,1}4 (which we interpret as a number in [16]), 
that outputs the i-th digit of 7 in the binary basis. 
Table 4.1: An example of a function G : {0,1}* — {0,1}. 


Input (x) Output (G(z)) 


0000 
0001 
0010 
0011 
0100 
0101 
0110 
0111 
1000 
1001 
1010 
1011 
1100 
1101 
1110 
1111 


ePrePrerooqojoo9coroorococoo fF me 
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For every x € {0,1}4, G(x) = LOOKUP,(1100100100001111, x), and 
so the following is NAND-CIRC “pseudocode” to compute G using 
syntactic sugar for the LOOKUP_4 procedure. 


G0000 = 1 
G1000 = 1 
G0100 = Q 
G0111 = 1 
G1111 = 1 
Y[@] = LOOKUP_4(GQ000,G1000,...,G1111, 


XC], X01], X02], X03) 


We can translate this pseudocode into an actual NAND-CIRC pro- 
gram by adding three lines to define variables zero and one that are 
initialized to 0 and 1 respectively, and then replacing a statement such 
as Gxxx = @ with Gxxx = NAND(one, one) and a statement such as 
Gxxx = 1 with Gxxx = NAND(zero, zero). The call to LOOKUP_4 will 
be replaced by the NAND-CIRC program that computes LOOKUP,, 
plugging in the appropriate inputs. 

There was nothing about the above reasoning that was particular to 
the function G above. Given every function F : {0,1}" — {0,1}, we 
can write a NAND-CIRC program that does the following: 


1. Initialize 2” variables of the form F00. . .@ till F11...1 so that for 
every z € {0,1}”, the variable corresponding to z is assigned the 
value F(z). 


2. Compute LOOKUP,, on the 2” variables initialized in the previ- 
ous step, with the index variable being the input variables X[0 
],...XEn — 1 ]. That is, just like in the pseudocode for G above, we 
use Y[Q@] = LOOKUP(FQ0..00,...,F11..1,XL@],..,X[m—1]) 


The total number of lines in the resulting program is 3 + 2” lines for 
initializing the variables plus the 4 - 2” lines that we pay for computing 
LOOKUP,,. This completes the proof of Theorem 4.12. 
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4.4.2 Improving by a factor of n (optional) 

By being a little more careful, we can improve the bound of Theo- 
rem 4.12 and show that every function F : {0,1}" — {0,1} canbe 
computed by a NAND-CIRC program of at most O(m2”/n) lines. In 
other words, we can prove the following improved version: 


Theorem 4.15 — Universality of NAND circuits, improved bound. There ex- 
istsaconstantc > 0 such that for every nım > 0 and function 
f : {0,1}” — {0,1}, there is a NAND-CIRC program with at most 
c-m2"/n lines that computes the function f. 3 


Proof. As before, it is enough to prove the case that m = 1. Hence 
we let f : {0,1}” — {0,1}, and our goal is to prove that there exists 
a NAND-CIRC program of O(2” /n) lines (or equivalently a Boolean 
circuit of O(2”/n) gates) that computes f. 

We let k = log(n — 2log n) (the reasoning behind this choice will 
become clear later on). We define the function g : {0,1}* > {0,1}2" “ 
as follows: 


g(a) = f(a0"-*) f(ad"-*11) -- f(a1"—*) . 


In other words, if we use the usual binary representation to identify 
the numbers {0,..., 2" -* — 1} with the strings {0,1}"~*, then for every 
a € {0,1}* and b € {0,1}”-* 


gla), = f(ab) . (4.2) 


k n-k 


=e BA 


e 


g(a) F(ao"-*) [roo | i ik m | | m ‘ flan) 


f (ab) = g(a)p 


gn-k 


(4.2) means that for every x € {0,1}”,if we write x = ab with 
a € {0,1}*andb € {0,1}"-* then we can compute f(x) by first 


° The constant c in this theorem is at most 10 and in 
fact can be arbitrarily close to 1, see Section 4.8. 


Figure 4.8: We can compute f : {0,1}” — {0,1} on 
input z = ab where a € {0,1}* and b € {0,1}"-* 
by first computing the 2”~* long string g(a) that 
corresponds to all f’s values on inputs that begin with 
a, and then outputting the b-th coordinate of this 
string. 
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computing the string T = g(a) of length 2”-*, and then computing 
LOOKUP „(T , b) to retrieve the element of T at the position cor- 
responding to b (see Fig. 4.8). The cost to compute the LOOKUP,,_;, 
is O(2" *) lines/gates and the cost in NAND-CIRC lines (or Boolean 
gates) to compute f is at most 


cost(g) + O(2"-*) , (4.3) 


where cost(g) is the number of operations (i.e., lines of NAND-CIRC 
programs or gates in a circuit) needed to compute g. 

To complete the proof we need to give a bound on cost(g). Since g 
is a function mapping {0, 1}* to {0, 1}2” “, we can also think of it as a 
collection of 2"~* functions go, --. , gon-e_1 : {0,1}* — {0,1}, where 
g(x) = g(a); for every a € {0,1}* andi € [2”-*]. (That is, g;(a) is 
the i-th bit of g(a).) Naively, we could use Theorem 4.12 to compute 
each g; in O(2") lines, but then the total cost is O(2"~* - 2") = O(2”) 
which does not save us anything. However, the crucial observation 
is that there are only 2?" distinct functions mapping {0, 1}" to {0, 1}. 
For example, if gı7 is an identical function to gg7 that means that if 
we already computed g,7(a) then we can compute gg7(a) using only 
a constant number of operations: simply copy the same value! In 
general, if you have a collection of N functions go, ..., 9y_, mapping 
{0, 1}* to {0, 1}, of which at most S are distinct then for every value 
a € {0,1}* we can compute the N values g(a), ..., gy—; (a) using at 
most O(S - 2¥ + N) operations (see Fig. 4.9). 

In our case, because there are at most 2?" distinct functions map- — 


«fal co 


ping {0,1}* to {0,1}, we can compute the function g (and hence by 
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202%) 
(4.2) also f) using at most Ki ASN 
O(1) ops 
Hota) 9a) ba 


O(22" . 2k + 20-4) (4.4) mw 
operations. Now all that is left is to plug into (4.4) our choice of k = i 
log(n — 2log n). By definition, 2* = n — 2 log n, which means that (4.4) 
can be bounded Figure 4.9: If go, ..., gy_1 is. a collection of functions 
each mapping {0, 1}” to {0, 1} such that at most S 
O (2n-2logn . (n a, log n) + gn—log(n—2log n) < of them are distinct then for every a € {0, 1}*, we 


can compute all the values gọ(a), ... , gy1 (a) using 
at most O(S - 2* + N) operations by first computing 
the distinct functions and then copying the resulting 


O(2z-nt+ )<o(F+ 2—) = O(*) values. 


0.5n n 


—— 
n—2logn 
which is what we wanted to prove. (We used above the fact that n — 
2logn > 0.5 log n for sufficiently large n.) 

a 


Using the connection between NAND-CIRC programs and Boolean 
circuits, an immediate corollary of Theorem 4.15 is the following 
improvement to Theorem 4.13: 
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Theorem 4.16 — Universality of Boolean circuits, improved bound. There 
exists some constant c > 0 such that for every n,m > 0 and func- 
tion f : {0,1}" — {0,1}™, there is a Boolean circuit with at most 
c: m2” /n gates that computes the function f . 


4.5 COMPUTING EVERY FUNCTION: AN ALTERNATIVE PROOF 


Theorem 4.13 is a fundamental result in the theory (and practice!) of 
computation. In this section, we present an alternative proof of this 
basic fact that Boolean circuits can compute every finite function. This 
alternative proof gives a somewhat worse quantitative bound on the 
number of gates but it has the advantage of being simpler, working 
directly with circuits and avoiding the usage of all the syntactic sugar 
machinery. (However, that machinery is useful in its own right, and 
will find other applications later on.) 


Theorem 4.17 — Universality of Boolean circuits (alternative phrasing). There 
exists some constant c > 0 such that for every n,m > 0 and func- 
tion f : {0,1}" — {0,1}™, there is a Boolean circuit with at most 
c-m-+n2” gates that computes the function f . 


Proof Idea: 

The idea of the proof is illustrated in Fig. 4.10. As before, it is 
enough to focus on the case that m = 1 (the function f has a sin- 
gle output), since we can always extend this to the case of m > 1 
by looking at the composition of m circuits each computing a differ- 
ent output bit of the function f. We start by showing that for every 
a € {0, 1}”, there is an O(n)-sized circuit that computes the function 
ôa : {0,1}" — {0,1} defined as follows: 6, (a) = 1 iff x = a (that is, 
ôa outputs 0 on all inputs except the input a). We can then write any 
function f : {0,1}” — {0,1} as the OR of at most 2” functions ô, for 
the a’s on which f(a) = 1. 

* 


Proof of Theorem 4.17. We prove the theorem for the case m = 1. The 
result can be extended for m > 1 as before (see also Exercise 4.9). Let 
f : {0,1}" — {0,1}. We will prove that there is an O(n - 2”)-sized 
Boolean circuit to compute f in the following steps: 


1. We show that for every a € {0,1}”, there is an O(n)-sized circuit 
that computes the function ô, : {0,1}” — {0,1}, where 6,(x) = 1 iff 


t=a. 


2. We then show that this implies the existence of an O(n - 2”)-sized 
circuit that computes f, by writing f(x) as the OR of ô„(x) for all 


Figure 4.10: Given a function f : {0,1}" — {0,1}, 
we let {£0, £1, ...,®y_1} C {0,1}” be the set of 
inputs such that f(x;) = 1, and note that N < 2”. 
We can express f as the OR of 6,,, for i € [N] where 
the function 6, : {0,1}” — {0,1} (for a € {0,1}") 
is defined as follows: 6,(a) = 1 iff x = a. We can 
compute the OR of N values using N two-input OR 
gates. Therefore if we have a circuit of size O(n) to 
compute ô for every a € {0, 1}", we can compute f 
using a circuit of size O(n - N) = O(n - 2”). 
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a € {0,1}" such that f(a) = 1. (If f is the constant zero function 
and hence there is no such a, then we can use the circuit f(x) = 
To A Xo.) 


We start with Step 1: 
CLAIM: For a € {0,1}”, define 6, : {0,1}” as follows: 


5 ( ) 1 x=a 
alz) = : 
0 otherwise 


then there is a Boolean circuit using at most 2n gates that computes ô,- 
PROOF OF CLAIM: The proof is illustrated in Fig. 4.11. As an 
example, consider the function 6,;, : {0,1} > {0,1}. This function 
outputs 1 on x if and only if zọ = 0, x4 = 1 and z, = 1, and so we can 
write J911(@) = To A x1 A £a, which translates into a Boolean circuit 
with one NOT gate and two AND gates. More generally, for every 
a € {0,1}”, we can express 6,(@) as (£o = Ag) A (£1 = A, )^A A (£n = 
Q,,1), Where if a; = 0 we replace x, = a, with z; and if a; = 1 we 
replace x; = a, by simply «,. This yields a circuit that computes 6, 
using n AND gates and at most n NOT gates, so a total of at most 2n 
gates. 
Now for every function f : {0,1}" — {0,1}, we can write 


F(E) = 6, (a) V da, (2) Viv V bey, (@) (4.5) 


where S = {£9, ... , £ y1} is the set of inputs on which f outputs 1. 
(To see this, you can verify that the right-hand side of (4.5) evaluates 
to Lona € {0, 1}” if and only if z is in the set S.) 

Therefore we can compute f using a Boolean circuit of at most 2n 
gates for each of the N functions 6,, and combine that with at most N 
OR gates, thus obtaining a circuit of at most 2n - N + N gates. Since 
S C {0,1}”, its size N is at most 2” and hence the total number of 
gates in this circuit is O(n - 2”). 


4.6 THE CLASS SIZE, ,,(s) 


We have seen that every function f : {0,1}" — {0,1} can be com- 
puted by a circuit of size O(m - 2”), and some functions (such as ad- 
dition and multiplication) can be computed by much smaller circuits. 
We define SIZE, ,,,(s) to be the set of functions mapping n bits to m 
bits that can be computed by NAND circuits of at most s gates (or 
equivalently, by NAND-CIRC programs of at most s lines). Formally, 
the definition is as follows: 


Figure 4.11: For every string a € {0,1}”, there is a 
Boolean circuit of O(n) gates to compute the function 
ôa : {0,1}" — {0,1} such that ô (x) = 1 if and 
only if x = a. The circuit is very simple. Given input 
Tos- , En—ı We compute the AND of zp, ..., Zn—1 
where z; = z; if a; = land z; = NOT(a;) if a; = 0. 
While formally Boolean circuits only have a gate for 
computing the AND of two inputs, we can implement 
an AND of n inputs by composing n two-input 
ANDs. 
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Definition 4.18 — Size class of functions. For all natural numbers n, m, s, 
let SIZE,, ,,,(s) denote the set of all functions f : {0,1}" — {0,1}™ 
such that there exists a NAND circuit of at most s gates comput- 
ing f. We denote by SIZE,,(s) the set SIZE, (s). For every integer 
s > 1,welet SIZE(s) = U, SIZE, m(5) be the set of all functions 
f for which there exists a NAND circuit of at most s gates that 
compute f. 


Fig. 4.12 depicts the set SIZE,, ,(s). Note that SIZE, ,,,(s) is a set of 
functions, not of programs! Asking if a program or a circuit is a mem- 
ber of SIZE, ,,,(s) is a category error as in the sense of Fig. 4.13. As we 
discussed in Section 3.7.2 (and Section 2.6.1), the distinction between 
programs and functions is absolutely crucial. You should always re- 
member that while a program computes a function, it is not equal to 
a function. In particular, as we’ve seen, there can be more than one 
program to compute the same function. 


Functions f: {0,1}" > {0,1} Programs/Circuits n inputs 1 output 


e (J 

o 4n 
An 

° zona € SIZES : 

e < n? lines 


n 
2" -= 


- @ @ @ 

eeeeeeese ee 
IA 
3 A 
6 ¥ 
oO UV 
a 

sajes/saul| Jo saquinu 


g E SIZE, (10 - 2”) 


ee 
eee 
e @ 10-2" lines 


Every function computed by 
many circuits 


Every program/circuit 
computes one function 


While we defined SIZE,,(s) with respect to NAND gates, we 
would get essentially the same class if we defined it with respect to 
AND/OR/NOT gates: 


Lemma 4.19 Let SIZE, (s ) denote the set of all functions f : {0,1}” > 


{0, 1}” that can be computed by an AND/OR/NOT Boolean circuit of 
at most s gates. Then, 


SIZE, (8/2) C SIZEAON(s) C SIZE, m(38) 


Proof. If f can be computed by a NAND circuit of at most s/2 gates, 
then by replacing each NAND with the two gates NOT and AND, we 
can obtain an AND/OR/NOT Boolean circuit of at most s gates that 


Figure 4.12: There are 2?” functions mapping {0, 1}” 
to {0, 1}, and an infinite number of circuits with n bit 
inputs and a single bit of output. Every circuit com- 
putes one function, but every function can be com- 
puted by many circuits. We say that f € SIZE,, (8) 
if the smallest circuit that computes f has s or fewer 
gates. For example XOR,, € SIZE,, (4n). Theo- 
rem 4.12 shows that every function g is computable 
by some circuit of at most c - 2” /n gates, and hence 
SIZE, (c - 2"/n) corresponds to the set of all func- 
tions from {0, 1}” to {0, 1}. 


SYNTACTIC SUGAR, AND COMPUTING EVERY FUNCTION 179 


computes f. On the other hand, if f can be computed by a Boolean 
AND/OR/NOT circuit of at most s gates, then by Theorem 3.12 it can 
be computed by a NAND circuit of at most 3s gates. 

a 


The results we have seen in this chapter can be phrased as showing 
that ADD,, € SIZE», »41(100n) and MULT, € SIZE», 2n (10000n'°82 *). 
Theorem 4.12 shows that for some constant c, SIZE,, ,,,(cm2"”) is equal 
to the set of all functions from {0, 1}” to {0,1}. 


E EVEN ? 


Figure 4.13: A “category error” is a question such as 
“is a cucumber even or odd?” which does not even 
make sense. In this book one type of category error 
you should watch out for is confusing functions and 
programs (i.e., confusing specifications and implemen- 
tations). If C is a circuit or program, then asking if 

C € SIZE,, ;(s) is a category error, since SIZE, (s) is 
a set of functions and not programs or circuits. 


Solved Exercise 4.1 — SIZE closed under complement.. In this exercise we 
prove a certain “closure property” of the class SIZE, (s). That is, we 
show that if f is in this class then (up to some small additive term) so 
is the complement of f, which is the function g(x) = 1 — f(z). 

Prove that there is a constant c such that for every f : {0,1}” > 
{0,1} ands € N, if f € SIZE, (s) then 1 — f € SIZE, (s+ c). 


Solution: 

Iff © SIZE,,(s) then there is an s-line NAND-CIRC program 
P that computes f. We can rename the variable Y[0] in P toa 
variable temp and add the line 


YC] = NAND(temp, temp) 


at the very end to obtain a program P’ that computes 1 — f. 
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4.7 EXERCISES 


Exercise 4.1 — Pairing. This exercise asks you to give a one-to-one map 
from N? to N. This can be useful to implement two-dimensional arrays 
as “syntactic sugar” in programming languages that only have one- 
dimensional arrays. 


1. Prove that the map F(x,y) = 273" is a one-to-one map from N? to 
N. 


2. Show that there is a one-to-one map F : N? — N such that for every 
x,y, F(x,y) < 100 - max{z, y}? + 100. 


3. For every k, show that there is a one-to-one map F : NE — N such 
that for every %,...,%,_1 E N, F(a,...,£,_1) < 100+ (£o +a, +... + 
£p—ı + 100k)*. 


Exercise 4.2 — Computing MUX. Prove that the NAND-CIRC program be- 
low computes the function MUX (or LOOKUP) where MUX(a, b, c) 
equals a if c = 0 and equals b if c = 1: 


t = NAND(X[2],X[2]) 
u = NAND(X[0], t) 
v = NAND(X[1],X[2]) 
Y[@] = NAND(u,v) 
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Exercise 4.3 — At least two / Majority. Give a NAND-CIRC program of at 
most 6 lines to compute the function MAJ : {0,1} — {0,1} where 
MAJ(a,b,c) = liffa+b+c>2. 


Exercise 4.4 — Conditional statements. In this exercise we will explore The- 
orem 4.6: transforming NAND-CIRC-IF programs that use code such 
asif .. then .. else .. to standard NAND-CIRC programs. 


1. Give a “proof by code” of Theorem 4.6: a program in a program- 
ming language of your choice that transforms a NAND-CIRC-IF 
program P into a “sugar-free” NAND-CIRC program P’ that com- 


putes the same function. See footnote for hint.4 PROC program that uses procedure statements, and 
then use the code of Fig. 4.3 to transform the latter 


2. Prove the following statement, which is the heart of Theorem 4.6: into a “sugar-free” NAND-CIRC program. 
suppose that there exists an s-line NAND-CIRC program to com- 
pute f : {0,1}” — {0,1} and an s’-line NAND-CIRC program 
to compute g : {0,1}" — {0,1}. Prove that there exist a NAND- 
CIRC program of at most s + s’ + 10 lines to compute the func- 
tionh : {0,1}"t1 — {0,1} where h(zo,...,%p_1,; 2n) equals 
f (£9, +--+; 2n_1) if £, = Oand equals g(x, ...,v,_,) otherwise. 
(All programs in this item are standard “sugar-free” NAND-CIRC 
programs. ) 


Exercise 4.5 — Half and full adders. 1. A half adder is the function HA : 
{0,1}? :— {0, 1}° that corresponds to adding two binary bits. That 
is, for every a,b € {0,1}, HA(a, b) = (e, f) where 2e + f = a + b. 
Prove that there is a NAND circuit of at most five NAND gates that 
computes HA. 


2. A full adder is the function FA : {0,1} —> {0,1}? that takes in 
two bits and a “carry” bit and outputs their sum. That is, for every 
a,b,c € {0,1}, FA(a,b,c) = (e, f) such that 2e + f =a+b+e. 
Prove that there is a NAND circuit of at most nine NAND gates that 
computes FA. 


3. Prove that if there is a NAND circuit of c gates that computes FA, 
then there is a circuit of cn gates that computes ADD,, where (as 
in Theorem 4.7) ADD,, : {0,1}?” — {0,1}"*? is the function that : ; , l l 

outputs the addition of two input n-bit numbers. See footnote for ee 


hint.” in the elementary-school algorithm. 


4. Show that for every n there is a NAND-CIRC program to compute 
ADD, with at most 9n lines. 


* You can start by transforming P into a NAND-CIRC- 


other, starting with the least significant digit, just like 
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Exercise 4.6 — Addition. Write a program using your favorite program- 
ming language that on input of an integer n, outputs a NAND-CIRC 
program that computes ADD,,. Can you ensure that the program it 
outputs for ADD,, has fewer than 10n lines? 


Exercise 4.7 — Multiplication. Write a program using your favorite pro- 
gramming language that on input of an integer n, outputs a NAND- 
CIRC program that computes MULT,,. Can you ensure that the pro- 
gram it outputs for MULT,, has fewer than 1000 - n? lines? 


Exercise 4.8 — Efficient multiplication (challenge). Write a program using 
your favorite programming language that on input of an integer n, 
outputs a NAND-CIRC program that computes MULT,, and has at 
most 10000n!? lines. What is the smallest number of lines you can 
use to multiply two 2048 bit numbers? 


Exercise 4.9 — Multibit function. In the text Theorem 4.12 is only proven 
for the case m = 1. In this exercise you will extend the proof for every 
m. 

Prove that 


1. If there is an s-line NAND-CIRC program to compute 
f : {0,1}" — {0,1} and an s’-line NAND-CIRC program 
to compute f’ : {0,1}" — {0,1} then there is an s + s’-line 
program to compute the function g : {0,1}" — {0,1}? such that 
g(a) = (F(x), f (2)). 


2. For every function f : {0,1}” — {0,1}™, there is a NAND-CIRC 
program of at most 10m - 2” lines that computes f. (You can use the 
m = 1 case of Theorem 4.12, as well as Item 1.) 


Exercise 4.10 — Simplifying using syntactic sugar. Let P be the following 
NAND-CIRC program: 


Temp[2] = NAND(X[0],X[0]) 


Temp[1] = NAND(X[1],X[1]) 


Temp[2] = NAND(Temp[9],Temp[1]) 


Temp[3] = NAND(X[2],X[2]) 


Temp[4] = NAND(X[3],X[3]) 


Temp[5] = NAND(Temp[3],Temp[4]) 


Temp[6] = NAND(Temp[2],Temp[2]) 


Temp[7] = NAND(Temp[5],Temp[5]) 


YC] = NAND(Temp[6],Temp[7]) 


° Hint: Use Karatsuba’s algorithm. 
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1. Write a program P’ with at most three lines of code that uses both 
NAND as well as the syntactic sugar OR that computes the same func- 
tion as P. 


2. Draw a circuit that computes the same function as P and uses only 
AND and NOT gates. 


In the following exercises you are asked to compare the power of 
pairs of programming languages. By “comparing the power” of two 
programming languages X and Y we mean determining the relation 
between the set of functions that are computable using programs in X 
and Y respectively. That is, to answer such a question you need to do 
both of the following: 


1. Either prove that for every program P in X there is a program P’ 
in Y that computes the same function as P, or give an example for 
a function that is computable by an X-program but not computable 
by a Y-program. 


and 


2. Either prove that for every program P in Y there is a program P’ 
in X that computes the same function as P, or give an example for a 
function that is computable by a Y-program but not computable by 
an X-program. 


When you give an example as above of a function that is com- 
putable in one programming language but not the other, you need 
to prove that the function you showed is (1) computable in the first 
programming language and (2) not computable in the second program- 
ming language. 
Exercise 4.11 — Compare IF and NAND. Let IF-CIRC be the programming 
language where we have the following operations foo = @, foo = 1, 
foo = IF(cond, yes,no) (that is, we can use the constants 0 and 1, 
and the IF : {0,1}% — {0,1} function such that IF(a, b, c) equals b if 
a = 1 and equals c if a = 0). Compare the power of the NAND-CIRC 
programming language and the IF-CIRC programming language. 


Exercise 4.12 — Compare XOR and NAND. Let XOR-CIRC be the pro- 
gramming language where we have the following operations foo 

= XOR(bar,blah), foo = 1 and bar = Q (thatis, we can use the 
constants 0, 1 and the XOR function that maps a,b € {0,1}? to a + b 
mod 2). Compare the power of the NAND-CIRC programming 
language and the XOR-CIRC programming language. See footnote for 
hint.” 
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Exercise 4.13 — Circuits for majority. Prove that there is some constant c 
such that for every n > 1, MAJ, € SIZE,,(cn) where MAJ, : {0,1}" > 
{0, 1} is the majority function on n input bits. That is MAJ, (a) = 1 iff 


—1 i 3 h to solve this is usi i 
ar x, > n/2. See footnote for hint. 8 One approach to solve this is using recursion and the 


so-called Master Theorem. 


Exercise 4.14 — Circuits for threshold. Prove that there is some constant c 
such that for every n > 1, and integers ag, ...,@,_1,0 E {—2"”, —2” + 
1,...,—1,0,+1,...,2”}, there isa NAND circuit with at most n° gates 
bp? {0,1}" — {0, 1} that 
on input x € {0,1}” outputs 1 if and only if Sa a,x; >b. 


that computes the threshold function fa, 


greg An 


4.8 BIBLIOGRAPHICAL NOTES 


See Jukna’s and Wegener’s books [Juk12; Weg87] for much more 
extensive discussion on circuits. Shannon showed that every Boolean 
function can be computed by a circuit of exponential size [Sha38]. The 
improved bound of c - 2”/n (with the optimal value of c for many 
bases) is due to Lupanov [Lup58]. An exposition of this for the case 
of NAND (where c = 1) is given in Chapter 4 of his book [Lup84]. 
(Thanks to Sasha Golovnev for tracking down this reference!) 

The concept of “syntactic sugar” is also known as “macros” or 
“meta-programming” and is sometimes implemented via a prepro- 
cessor or macro language in a programming language or a text editor. 
One modern example is the Babel JavaScript syntax transformer, that 
converts JavaScript programs written using the latest features into 
a format that older Browsers can accept. It even has a plug-in ar- 
chitecture, that allows users to add their own syntactic sugar to the 
language. 


Learning Objectives: 


See one of the most important concepts in 
computing: duality between code and data. 


Build up comfort in moving between 
different representations of programs. 


Follow the construction of a “universal circuit 
evaluator” that can evaluate other circuits 
given their representation. 


See major result that complements the result 
5 of the last chapter: some functions require an 
exponential number of gates to compute. 


Discussion of Physical extended 

Code as da ta, da ta as code Church-Turing thesis stating that Boolean 
circuits capture all feasible computation in 
the physical world, and its physical and 
philosophical implications. 


“The term code script is, of course, too narrow. The chromosomal structures 
are at the same time instrumental in bringing about the development they 
foreshadow. They are law-code and executive power - or, to use another simile, 
they are architect's plan and builder's craft - in one.” , Erwin Schrödinger, 
1944. 


“A mathematician would hardly call a correspondence between the set of 64 
triples of four units and a set of twenty other units,” universal”, while such 
correspondence is, probably, the most fundamental general feature of life on 
Earth”, Misha Gromov, 2013 


A program is simply a sequence of symbols, each of which can be 
encoded as a string of 0’s and 1’s using (for example) the ASCII stan- 
dard. Therefore we can represent every NAND-CIRC program (and 
hence also every Boolean circuit) as a binary string. This statement 
seems obvious but it is actually quite profound. It means that we can 
treat circuits or NAND-CIRC programs both as instructions to car- 
rying computation and also as data that could potentially be used as 
inputs to other computations. 


This correspondence between code and data is one of the most fun- 
damental aspects of computing. It underlies the notion of general 
purpose computers, that are not pre-wired to compute only one task, 
and also forms the basis of our hope for obtaining general artificial 
intelligence. This concept finds immense use in all areas of comput- 
ing, from scripting languages to machine learning, but it is fair to say 
that we haven't yet fully mastered it. Many security exploits involve 
cases such as “buffer overflows” when attackers manage to inject code 
where the system expected only “passive” data (see Fig. 5.1). The re- 
lation between code and data reaches beyond the realm of electronic 
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computers. For example, DNA can be thought of as both a program 


and data (in the words of Schrédinger, who wrote before the discov- 
ery of DNA’s structure a book that inspired Watson and Crick, DNA is 


both “architect’s plan and builder’s craft”). 


Representing programs/circuits =>. 
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Figure 5.1: As illustrated in this xkcd cartoon, many 
exploits, including buffer overflow, SQL injections, 
and more, utilize the blurry line between “active 
programs” and “static strings”. 


Figure 5.2: Overview of the results in this chapter. 
We use the representation of programs/circuits as 
strings to derive two main results. First we show 

the existence of a universal program/circuit, and 

in fact (with more work) the existence of such a 
program/circuit whose size is at most polynomial in 
the size of the program/circuit it evaluates. We then 
use the string representation to count the number 

of programs/circuits of a given size, and use that to 
establish that some functions require an exponential 
number of lines/gates to compute. 


CODE AS DATA, DATA AS CODE 


5.1 REPRESENTING PROGRAMS AS STRINGS 


We can represent programs or circuits as strings in a myriad of ways. 
For example, since Boolean circuits are labeled directed acyclic graphs, 
we can use the adjacency matrix or adjacency list representations for 
them. However, since the code of a program is ultimately just a se- 
quence of letters and symbols, arguably the conceptually simplest 
representation of a program is as such a sequence. For example, the 
following NAND-CIRC program P 


temp_®@ = NAND(X[@],X[1]) 

temp_1 = NAND(X[@], temp_®@) 
temp_2 = NAND(X[1], temp_®@) 
YL] = NAND(temp_1, temp_2) 


is simply a string of 107 symbols which include lower and upper 
case letters, digits, the underscore character _ and equality sign =, 


Wu 


punctuation marks such as “(”,”)”,”,”, spaces, and “new line” mark- 
ers (often denoted as “\n” or “a” ). Each such symbol can be encoded 
as a string of 7 bits using the ASCII encoding, and hence the program 
P can be encoded as a string of length 7 - 107 = 749 bits. 

Nothing in the above discussion was specific to the program P, and 
hence we can use the same reasoning to prove that every NAND-CIRC 
program can be represented as a string in {0, 1}*. In fact, we can doa 
bit better. Since the names of the working variables of a NAND-CIRC 
program do not affect its functionality, we can always transform a pro- 
gram to have the form of P’ where all variables apart from the inputs 
and outputs have the form temp_Q, temp_1, temp_2, etc.. Moreover, 
if the program has s lines, then we will never need to use an index 
larger than 3s (since each line involves at most three variables), and 
similarly the indices of the input and output variables will all be at 
most 3s. Since a number between 0 and 3s can be expressed using 
at most [log ¿(3s + 1)| = O(log s) digits, each line in the program 
(which has the form foo = NAND(bar,blah)), can be represented 
using O(1) + O(log s) = O(log s) symbols, each of which can be rep- 
resented by 7 bits. Hence an s line program can be represented as a 
string of O(s log s) bits, resulting in the following theorem: 


Theorem 5.1 — Representing programs as strings. There is a constant c 


such that for f € SIZE(s), there exists a program P computing f 
whose string representation has length at most cs log s. 
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5.2 COUNTING PROGRAMS, AND LOWER BOUNDS ON THE SIZE 
OF NAND-CIRC PROGRAMS 


One consequence of the representation of programs as strings is that 
the number of programs of certain length is bounded by the number 
of strings that represent them. This has consequences for the sets 
SIZE, m(s) that we defined in Section 4.6. 


Theorem 5.2 — Counting programs. For every s,n,m € N, 
[SIZE as) (eee, 


That is, there are at most 2°'*!°8*) functions computed by NAND- 


CIRC programs of at most s lines. ! 


Proof. For any n,m € N, we will show a one-to-one map E from 
SIZE,, (8) to the set of strings of length cs log s for some constant 
c. This will conclude the proof, since it implies that |SIZE,, ,,,(s)| is 
smaller than the size of the set of all strings of length at most £ = 


cslog s. The size of the latter set is 1+2+4+--+2° = 2°t! — 1 by the 


formula for sums of geometric progressions. 

The map E will simply map f to the representation of the smallest 
program computing f. Since f € SIZE,, ,,,(s), there is a program P 
of at most s lines that can be represented using a string of length at 
most cs log s by Theorem 5.1. Moreover, the map f t+ E(f) is one to 
one, since for every distinct f, f’ : {0,1}" — {0,1}™ there must exist 
some input x € {0,1}” on which f(x) # f'(x). This means that the 
programs that compute f and f’ respectively cannot be identical. 


Theorem 5.2 has an important corollary. The number of func- 
tions that can be computed using small circuits/programs is much 
smaller than the total number of functions, and hence there ex- 
ist functions that require very large (in fact exponentially large) cir- 
cuits to compute. To see why this is the case, note that a function 
mapping {0, 1}? to {0, 1} can be identified with the list of its four 
values on the inputs 00,01, 10,11. A function mapping {0, 1}? to 
{0, 1} can be identified with the list of its eight values on the inputs 
000, 001, 010,011, 100, 101, 110, 111. More generally, every function 
F : {0,1}" — {0,1} can be identified with the list of its 2” values on 


1 The implicit constant in the O(-) notation is 

smaller than 10. That is, for all sufficiently large s, 
[SIZE m(8)| < 21051985, see Remark 5.4. As discussed 
in Section 1.7, we use the bound 10 simply because it 
is around number. 


the inputs {0, 1}”. Hence the number of functions mapping {0, 1}” 

to {0, 1} is equal to the number of possible 2” length lists of values 
which is exactly 2?" _ Note that this is double exponential in n, and hence 
even for small values of n (e.g., n = 10) the number of functions from 
{0, 1}” to {0, 1} is truly astronomical.? As mentioned, this yields the 
following corollary: 


Theorem 5.3 — Counting argument lower bound. There is a constant 

ô >  0,such that for every sufficiently large n, there is a function 
f : {0,1}” —> {0,1} such that f ¢ SIZE, (*2"). That is, the shortest 
NAND-CIRC program to compute f requires more than ô - 2”/n 


lines. 3 


Proof. The proof is simple. If we let c be the constant such that 
|SIZE,,(s)| < 2°!°88 and ô = 1/c, then setting s = 52”/n we see that 


n oa” n n 
[Size (aa a a 


using the fact that since s < 2”, logs < nand ô = 1/c. But since 
|SIZE,,(s)| is smaller than the total number of functions mapping n 
bits to 1 bit, there must be at least one such function not in SIZE, (s), 
which is what we needed to prove. 

a 


We have seen before that every function mapping {0, 1}” to {0,1} 
can be computed by an O(2” /n) line program. Theorem 5.3 shows 
that this is tight in the sense that some functions do require such an 


astronomical number of lines to compute. 


In fact, as we explore in the exercises, this is the case for most func- 
tions. Hence functions that can be computed in a small number of 
lines (such as addition, multiplication, finding short paths in graphs, 
or even the EVAL function) are the exception, rather than the rule. 
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2 “Astronomical” here is an understatement: there are 
much fewer than 2? stars, or even particles, in the 
observable universe. 


3 The constant ô is at least 0.1 and in fact, can be im- 
proved to be arbitrarily close to 1/2, see Exercise 5.7. 
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5.2.1 Size hierarchy theorem (optional) 

By Theorem 4.15 the class SIZE,,(10 - 2”/n) contains all functions 
from {0, 1}” to {0,1}, while by Theorem 5.3, there is some function 
f: {0,1}" — {0,1} that is not contained in SIZE, (0.1 - 2” /n). In other 
words, for every sufficiently large n, 


SIZE, (0.12") Ç SIZE,, (102") . 


It turns out that we can use Theorem 5.3 to show a more general re- 
sult: whenever we increase our “budget” of gates we can compute 
new functions. 


Theorem 5.5 — Size Hierarchy Theorem. For every sufficiently large n 
and 10n < s <0.1-2"/n, 


SIZE,,(s) Ș SIZE,,(s + 10n) . 


Proof Idea: 

To prove the theorem we need to find a function f : {0,1}” — {0,1} 
such that f can be computed by a circuit of s + 10n gates but it cannot 
be computed by a circuit of s gates. We will do so by coming up with 
a sequence of functions fo, f1, fo,.-., fy with the following properties: 
(1) fo can be computed by a circuit of at most 10n gates, (2) fy cannot 
be computed by a circuit of 0.1 - 2” /n gates, and (3) for every i € 
{0, ..., N}, if f; can be computed by a circuit of size s, then f;,, can be 
computed by a circuit of size at most s+10n. Together these properties 
imply that if we let i be the smallest number such that f; ¢ SIZE,,(s), 
then since f;_ı E€ SIZE,,(s) it must hold that f; € SIZE,,(s +10n) which 
is what we need to prove. See Fig. 5.4 for an illustration. 

* 


Proof of Theorem 5.5. Let f* : {0,1}" — {0,1} be the function 
(whose existence we are guaranteed by Theorem 5.3) such that 

f* $ SIZE, (0.1 - 2”/n). We define the functions fo, f,,..., fon map- 
ping {0,1}” to {0,1} as follows. For every x € {0,1}",iflex(x) € 
{0, 1,...,2” — 1} is x’s order in the lexicographical order then 


ile) = . lex(x) <i . 


0 otherwise 


fo f fz fi-1 ifar 
j n 
<10n <s Sgi- 


Figure 5.4: We prove Theorem 5.5 by coming up 

with a list fọ, ..., fon of functions such that fo is the 
all zero function, fən is a function (obtained from 
Theorem 5.3) outside of SIZE,,(0.1 - 2"/n) and such 
that f;_, and f; differ by one another on at most one 
input. We can show that for every i, the number of 
gates to compute f; is at most 10n larger than the 
number of gates to compute f;_, and so if we let i be 
the smallest number such that f; ¢ SIZE,,(s), then 
fi € SIZE,,(s + 10n). 


The function fy is simply the constant zero function, while the 
function fy, is equal to f*. Moreover, for every i € [2”], the functions 
fiand f,,, differ on at most one input (i.e., the input z € {0,1}” such 
that lex(x) = i). Let 10n < s < 0.1 - 2” /n, and let i be the first index 
such that f; ¢ SIZE, (s). Since fon = f* ¢ SIZE,,(0.1 - 2”/n) there 
must exist such an index i, and moreover i > 0 since the constant zero 
function is a member of SIZE,,(10n). 

By our choice of i, f,_, is a member of SIZE,,(s). To complete the 
proof, we need to show that f; € SIZE,,(s + 10n). Let x* be the string 
such that lex(x*) = i and let b € {0,1} be the value of f*(x*). Then we 
can define f, also as follows 


b r=2x 
f(z) = 


or in other words 
f(a) = fi (x) A EQUAL(2*, £) V b A -=EQUAL(a*, x) 


where EQUAL : {0,1}?" — {0,1} is the function that maps z, x’ € 
{0, 1}” to 1 if they are equal and to 0 otherwise. Since (by our choice 
of i), f;—ı can be computed using at most s gates and (as can be easily 
verified) that EQUAL € SIZE, (9n), we can compute f; using at most 
s+9n-+O(1) < s + 10n gates which is what we wanted to prove. 


Equality follows from universality 
of NAND circuits 


Alban = {f If: {0,1}" > {0,1}" } ¥ SIZEqn( 10-2") 


ALL, \ SIZE(0.1 2"/n) 
is not empty by the 


Ae counting lower bound 


Not empty by the size 
hierarchy theorem 


<| 


SIZEnn( 2°”) 


Based on best-known algorithms for these problems. We 
don’t know what is their true complexity. It could be that 
all three problems are in SIZ En n (1007) 
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Figure 5.5: An illustration of some of what we know 
about the size complexity classes (not to scale!). This 
figure depicts classes of the form SIZE, n (s) but the 
state of affairs for other size complexity classes such 
as SIZE,, ;(s) is similar. We know by Theorem 4.12 
(with the improvement of Section 4.4.2) that all 
functions mapping n bits to n bits can be computed 
by a circuit of size c - 2” for c < 10, while on the 
other hand the counting lower bound (Theorem 5.3, 
see also Exercise 5.4) shows that some such functions 
will require 0.1 - 2”, and the size hierarchy theorem 
(Theorem 5.5) shows the existence of functions in 
SIZE,,(S) \ SIZE,,(s) whenever s = o(S), see also 
Exercise 5.5. We also consider some specific examples: 
addition of two n/2 bit numbers can be done in O(n) 
lines, while we don’t know of such a program for 
multiplying two n bit numbers, though we do know 

it can be done in O(n?) and in fact even better size. 
In the above, FACTOR,, corresponds to the inverse 
problem of multiplying- finding the prime factorization 
of a given number. At the moment we do not know 
of any circuit a polynomial (or even sub-exponential) 
number of lines that can compute FACTOR,,. 
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5.3 THE TUPLES REPRESENTATION 


ASCII is a fine presentation of programs, but for some applications 
it is useful to have a more concrete representation of NAND-CIRC 
programs. In this section we describe a particular choice, that will 
be convenient for us later on. A NAND-CIRC program is simply a 
sequence of lines of the form 


blah = NAND(baz, boo) 


There is of course nothing special about the particular names we 
use for variables. Although they would be harder to read, we could 
write all our programs using only working variables such as temp_2, 
temp_1 etc. Therefore, our representation for NAND-CIRC programs 
ignores the actual names of the variables, and just associate a number 
with each variable. We encode a line of the program as a triple of 
numbers. If the line has the form foo = NAND(bar,blah) then we 
encode it with the triple (i, j, k) where i is the number corresponding 
to the variable foo and j and k are the numbers corresponding to bar 
and blah respectively. 

More concretely, we will associate every variable with a number 
in the set [t] = {0,1,...,¢ — 1}. The first n numbers {0,...,n — 1} 
correspond to the input variables, the last m numbers {t — m, ...,t — 1} 
correspond to the output variables, and the intermediate numbers 
{n,...,t — m — 1} correspond to the remaining “workspace” variables. 
Formally, we define our representation as follows: 


Definition 5.7 — List of tuples representation. Let P be a NAND-CIRC 
program of n inputs, m outputs, and s lines, and let t be the num- 
ber of distinct variables used by P. The list of tuples representation 
of P is the triple (n, m, L) where L is a list of triples of the form 
(i,j, k) for i, j,k € [t]. 

We assign a number for each variable of P as follows: 


e For every i € [|n], the variable X[i] is assigned the number i. 


e Foreveryj € [m], the variable Y[j] is assigned the number 
C= mF fe 


e Every other variable is assigned a number in {n,n+1,...,t—m— 
1} in the order in which the variable appears in the program P. 


The list of tuples representation is our default choice for represent- 
ing NAND-CIRC programs. Since “list of tuples representation” is a 
bit of a mouthful, we will often call it simply “the representation” for 
a program P. Sometimes, when the number n of inputs and number 
m of outputs are known from the context, we will simply represent a 
program as the list L instead of the triple (n, m, L). 


m Example 5.8 — Representing the XOR program. Our favorite NAND- 
CIRC program, the program 


u = NAND(X[@], XL1]) 
v = NAND(X[@],u) 
w = NAND(X[1],u) 
YC] = NAND(v,w) 


computing the XOR function is represented as the tuple (2, 1, L) 
where L = ((2,0,1), (3,0, 2), (4,1, 2), (5,3, 4)). That is, the variables 
X[@] and X[1] are given the indices 0 and 1 respectively, the vari- 
ables u,v,w are given the indices 2, 3, 4 respectively, and the variable 
Y[@] is given the index 5. 


Transforming a NAND-CIRC program from its representation as 
code to the representation as a list of tuples is a fairly straightforward 
programming exercise, and in particular can be done in a few lines of 
Python.* The list-of-tuples representation loses information such as the 
particular names we used for the variables, but this is OK since these 
names do not make a difference to the functionality of the program. 


5.3.1 From tuples to strings 
If P is a program of size s, then the number t of variables is at most 3s 
(as every line touches at most three variables). Hence we can encode 
every variable index in [t] as a string of length £ = [log(3s)], by adding 
leading zeroes as needed. Since this is a fixed-length encoding, it is 
prefix free, and so we can encode the list L of s triples (corresponding 
to the encoding of the s lines of the program) as simply the string of 
length 3és obtained by concatenating all of these encodings. 

We define S(s) to be the length of the string representing the list L 
corresponding to a size s program. By the above we see that 


S(s) = 3s[log(3s)] . (5.1) 


We can represent P = (n,m, L) as a string by prepending a prefix 
free representation of n and m to the list L. Since n,m < 3s (a pro- 


CODE AS DATA, DATA AS CODE 193 


+ If you're curious what these few lines are, see our 
GitHub repository. 
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gram must touch at least once all its input and output variables), those 
prefix free representations can be encoded using strings of length 
O(log s). In particular, every program P of at most s lines can be rep- 
resented by a string of length O(s log s). Similarly, every circuit C of 

at most s gates can be represented by a string of length O(s log s) (for 
example by translating C to the equivalent program P). 


5.4 A NAND-CIRC INTERPRETER IN NAND-CIRC 


Since we can represent programs as strings, we can also think of a 
program as an input to a function. In particular, for every natural 
number s,n, m > 0 we define the function EVAL : {0,1}5@" 


{0,1} as follows: 


sn, m 


ay p € {0,1}5\) represents a size-s program P with n inputs and m outputs 
EVAL, n m(pz) = 
a 0™ — otherwise 
(5.2) 
where S(s) is defined as in (5.1) and we use the concrete representa- 
tion scheme described in Section 5.1. 

That is, EVAL, n,m takes as input the concatenation of two strings: 
astring p € {0,1}5) and a string x € {0,1}". If p isa string that 
represents a list of triples L such that (n, m, L) is a list-of-tuples rep- 
resentation of a size-s NAND-CIRC program P, then EVAL, n ,,(px) 
is equal to the evaluation P(x) of the program P on the input x. Oth- 
erwise, EVAL, ,, (px) equals 0™ (this case is not very important: you 
can simply think of 0 as some “junk value” that indicates an error). 


Take-away points. The fine details of EVAL, n,m s definition are not 
very crucial. Rather, what you need to remember about EVAL, n,m is 
that: 


e EVAL is a finite function taking a string of fixed length as input 


s,n,m 


and outputting a string of fixed length as output. 


e EVAL, n,m is a single function, such that computing EVAL, n,m 


allows to evaluate arbitrary NAND-CIRC programs of a certain 
length on arbitrary inputs of the appropriate length. 


e EVAL, n,m is a function, not a program (recall the discussion in Sec- 
tion 3.7.2). That is, EVAL, 


sn, m 


is a specification of what output is as- 
sociated with what input. The existence of a program that computes 
EVAL (i.e., an implementation for EVAL 


s,n,m 


s.n,m) iS a separate 
fact, which needs to be established (and which we will do in Theo- 
rem 5.9, with a more efficient program shown in Theorem 5.10). 


One of the first examples of self circularity we will see in this book is 
the following theorem, which we can think of as showing a “NAND- 
CIRC interpreter in NAND-CIRC”: 


Theorem 5.9 — Bounded Universality of NAND-CIRC programs. For every 
s,n,m E N with s > m there isa NAND-CIRC program U, n m that 


n,m 


computes the function EVAL, n,m: 


That is, the NAND-CIRC program U, n,m takes the description 
of any other NAND-CIRC program P (of the right length and input- 
s/outputs) and any input x, and computes the result of evaluating the 
program P on the input x. Given the equivalence between NAND- 
CIRC programs and Boolean circuits, we can also think of U, „n,m as 
a circuit that takes as input the description of other circuits and their 
inputs, and returns their evaluation, see Fig. 5.6. We call this NAND- 


CIRC program U, n,m that computes EVAL a bounded universal 


n,m s,n,m 


program (or a universal circuit, see Fig. 5.6). “Universal” stands for 
the fact that this is a single program that can evaluate arbitrary code, 
where “bounded” stands for the fact that U, n m only evaluates pro- 
grams of bounded size. Of course this limitation is inherent for the 
NAND-CIRC programming language, since a program of s lines (or, 
equivalently, a circuit of s gates) can take at most 2s inputs. Later, in 
Chapter 7, we will introduce the concept of loops (and the model of 


Turing machines), that allow to escape this limitation. 


Proof. Theorem 5.9 is an important result, but it is actually not hard to 
prove. Specifically, since EVAL, n,m is a finite function Theorem 5.9 is 
an immediate corollary of Theorem 4.12, which states that every finite 
function can be computed by some NAND-CIRC program. 


5.4.1 Efficient universal programs 
Theorem 5.9 establishes the existence of a NAND-CIRC program 
for computing EVAL 


s,n,m but it provides no explicit bound on the 
size of this program. Theorem 4.12, which we used to prove Theo- 
rem 5.9, guarantees the existence of a NAND-CIRC program whose 
size can be as large as exponential in the length of its input. This would 
mean that even for moderately small values of s, n, m (for example 

n = 100,s = 300,m = 1), computing EVAL 
NAND program with more lines than there are atoms in the observ- 


might require a 


s,n,m 


able universe! Fortunately, we can do much better than that. In fact, 
for every s, n,m there exists a NAND-CIRC program for comput- 
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U 
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Cue, x) = P(x) 


Figure 5.6: A universal circuit U is a circuit that gets as 
input the description of an arbitrary (smaller) circuit 
P as a binary string, and an input x, and outputs the 
string P(x) which is the evaluation of P on x. We can 
also think of U as a straight-line program that gets as 
input the code of a straight-line program P and an 


input x, and outputs P(x). 
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ing EVAL, n,m 
shown in the following theorem. 


with size that is polynomial in its input length. This is 


Theorem 5.10 — Efficient bounded universality of NAND-CIRC programs. 

For every s,n,m € N there is a NAND-CIRC program of at 
most O(s? log s) lines that computes the function EVAL, n,m 3 
{0,1}5+" — {0,1}” defined above (where S is the number of bits 
needed to represent programs of s lines). 


Unlike Theorem 5.9, Theorem 5.10 is not a trivial corollary of the 


fact that every finite function can be computed by some circuit. Prov- 
ing Theorem 5.10 requires us to present a concrete NAND-CIRC pro- 
gram for computing the function EVAL We will do so in several 


stages. 


s,n,m' 


1. First, we will describe the algorithm to evaluate EVAL, n,m in 
“pseudo code”. 


2. Then, we will show how we can write a program to compute 
EVAL; nm 


a reader that has familiarity with programming in any language 
should be able to follow along. 


in Python. We will not use much about Python, and 


3. Finally, we will show how we can transform this Python program 
into a NAND-CIRC program. 


This approach yields much more than just proving Theorem 5.10: 
we will see that it is in fact always possible to transform (loop free) 
code in high level languages such as Python to NAND-CIRC pro- 
grams (and hence to Boolean circuits as well). 


5.4.2 A NAND-CIRC interpeter in “pseudocode” 

To prove Theorem 5.10 it suffices to give a NAND-CIRC program of 
O(s? log s) lines that can evaluate NAND-CIRC programs of s lines. 
Let us start by thinking how we would evaluate such programs if we 
weren't restricted to only performing NAND operations. That is, let us 
describe informally an algorithm that on input n, m, s, a list of triples 


L, and a string x € {0,1}”, evaluates the program represented by 
(n,m, L) on the string z. 


We will now describe such an algorithm. We assume that we have 
access to a bit array data structure that can store for every i € [t] a 
bit T; € {0,1}. Specifically, if Table is a variable holding this data 
structure, then we assume we can perform the operations: 


The value of i is assumed to be an integer in [t]. 


Table = UPDATE(Table,i,b) which updates Table so the bit cor- 
responding to i is now set to b. The value of i is assumed to be an 
integer in |t] and b is a bit in {0,1}. 


Algorithm 5.11 evaluates the program given to it as input one line 
at a time, updating the Vartable table to contain the value of each 


GET (Table, i) which retrieves the bit corresponding to i in Table. 
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variable. At the end of the execution it outputs the variables at posi- 
tions t—m,t—m-+1,...,t— 1 which correspond to the input variables. 


5.4.3 A NAND interpreter in Python 
To make things more concrete, let us see how we implement Algo- 
rithm 5.11 in the Python programming language. (There is nothing 
special about Python. We could have easily presented a corresponding 
function in JavaScript, C, OCaml, or any other programming lan- 
guage.) We will construct a function NANDEVAL that on input n, m, L, x 
will output the result of evaluating the program represented by 
(n,m, L) on x. To keep things simple, we will not worry about the case 
that L does not represent a valid program of n inputs and m outputs. 
The code is presented in Fig. 5.7. 

Accessing an element of the array Vartable at a given index takes 
a constant number of basic operations. Hence (since n,m < s and 


t < 3s), the program above will use O(s) basic operations.” 


5.4.4 Constructing the NAND-CIRC interpreter in NAND-CIRC 

We now turn to describing the proof of Theorem 5.10. To prove the 
theorem it is not enough to give a Python program. Rather, we need to 
show how we compute the function EVAL using a NAND-CIRC 
program. In other words, our job is to transform, for every s,n,m, the 
Python code of Section 5.4.3 toa NAND-CIRC program U, n,m that 
computes the function EVAL, n,m: 


s,n,m 


Our construction will follow very closely the Python implementa- 
tion of EVAL above. We will use variables Vartable[0],...,vartable[2°— 
1], where £ = [log 3s] to store our variables. However, NAND doesn’t 
have integer-valued variables, so we cannot write code such as 


Vartable[i] for some variable i. However, we can implement the 
function GET(Vartable,i) that outputs the i-th bit of the array 
Vartable. Indeed, this is nothing but the function LOOKUP, that we 
have seen in Theorem 4.10! 


5 Python does not distinguish between lists and 
arrays, but allows constant time random access to an 
indexed elements to both of them. One could argue 
that if we allowed programs of truly unbounded 
length (e.g., larger than 2°*) then the price would 
not be constant but logarithmic in the length of the 
array /lists, but the difference between O(s) and 

O(s log s) will not be important for our discussions. 
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Figure 5.7: Code for evaluating a NAND-CIRC program given in the list-of-tuples representation 


def NANDEVAL(n,m,L,X): 
# Evaluate a NAND-CIRC program from list of tuple representation. 


s len(L) # num of lines 
t 


Vartable = [9] * t # initialize array 


max(max(a,b,c) for (a,b,c) in L)+1 # max index in L + 1 


# helper functions 
def GET(V,i): return V[i] 
def UPDATE(V,i,b): 

VCil=b 

return V 


# load input values to Vartable: 
for i in range(n): 
Vartable = UPDATE(Vartable,i,X{[i]) 


# Run the program 
for (i,j,k) in L: 
a = GET(Vartable, j) 
b = GET(Vartable,k) 
c = NAND(a,b) 
Vartable = UPDATE(Vartable,i,c) 


# Return outputs Vartable[t-m], Vartable[t-m+1],....,Vartable[t-1] 
return [GET(Vartable,t-m+j) for j in range(m)] 


# Test on XOR (2 inputs, 1 output) 

L = ((2, ð, 1), (3, ©, 2), (4, 1, 2), (5, 3, 4)) 
print(NANDEVAL(2,1,L,(@,1))) # XOR(@, 1) 

# [1] 

print(NANDEVAL(2,1,L,(1,1))) # XOR(1, 1) 

# [0] 
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We saw that we can compute LOOKUP, in time O(2°) = O(s) for 
our choice of £. 

For every £, let UPDATE, : {0, 1}2°+“*! — {0,1}? correspond to the 
UPDATE function for arrays of length 2°. That is, on input V € {0,1}2", 
i € {0,1}4, b € {0,1}, UPDATE,(V, b, i) is equal to V’ € {0,1}? such 


that 
J x ; 
b j=i 
where we identify the string i € {0, 1} with a number in {0, ..., 2f — 1} 
using the binary representation. We can compute UPDATE, using an 
O(2°€) = (slog s) line NAND-CIRC program as follows: 


1. For every j € [2°], there is an O(¢) line NAND-CIRC program to 
compute the function EQUALS, : {0, 1}* — {0,1} that on input i 
outputs 1 if and only if i is equal to (the binary representation of) j. 
(We leave verifying this as Exercise 5.2 and Exercise 5.3.) 


2. We have seen that we can compute the function IF : {0,1} — {0,1} 
such that IF(a, b, c) equals b if a = 1 and c if a = 0. 


Together, this means that we can compute UPDATE (using some 
“syntactic sugar” for bounded length loops) as follows: 


def UPDATE_ell(V,i,b): 
# Get V[0]...V[2^ell-1], i in {0,1}^ell, b in {0,1} 
# Return NewV[@],...,NewV[2%el1l-71] 
# updated array with NewV[i]=b and all 
# else same as V 
for j in range(2xxell): # j = 0,1,2,....,2^ell -1 
a = EQUALS_j(i) 
NewV[j] = IF(a,b,V[j]) 
return NewV 


Since the loop over j in UPDATE is run 2° times, and computing 
EQUALS_j takes O(£) lines, the total number of lines to compute UP - 
DATE is O(2° - £) = O(s logs). Once we can compute GET and UPDATE, 
the rest of the implementation amounts to “book keeping” that needs 
to be done carefully, but is not too insightful, and hence we omit the 
full details. Since we run GET and UPDATE s times, the total number of 
lines for computing EVAL, n m is O(s?) + O(s? log s) = O(s? log s). 
This completes (up to the omitted details) the proof of Theorem 5.10. 


5.5 A PYTHON INTERPRETER IN NAND-CIRC (DISCUSSION) 


To prove Theorem 5.10 we essentially translated every line of the 
Python program for EVAL into an equivalent NAND-CIRC snip- 

pet. However, none of our reasoning was specific to the particu- 

lar function EVAL. It is possible to translate every Python program 
into an equivalent NAND-CIRC program of comparable efficiency. 
(More concretely, if the Python program takes T(n) operations on 
inputs of length at most n then there exists NAND-CIRC program of 
O(T(n) log T(n)) lines that agrees with the Python program on inputs 
of length n.) Actually doing so requires taking care of many details 
and is beyond the scope of this book, but let me try to convince you 
why you should believe it is possible in principle. 

For starters, one can use CPython (the reference implementation 
for Python), to evaluate every Python program using a C program. We 
can combine this with a C compiler to transform a Python program 
to various flavors of “machine language”. So, to transform a Python 
program into an equivalent NAND-CIRC program, it is enough to 
show how to transform a machine language program into an equiva- 
lent NAND-CIRC program. One minimalistic (and hence convenient) 
family of machine languages is known as the ARM architecture which 
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powers many mobile devices including essentially all Android de- 
vices. There are even simpler machine languages, such as the LEG 
architecture for which a backend for the LLVM compiler was imple- 
mented (and hence can be the target of compiling any of the large 
and growing list of languages that this compiler supports). Other ex- 
amples include the TinyRAM architecture (motivated by interactive 
proof systems that we will discuss in Chapter 22) and the teaching- 
oriented Ridiculously Simple Computer architecture. Going one by 
one over the instruction sets of such computers and translating them 
to NAND snippets is no fun, but it is a feasible thing to do. In fact, 
ultimately this is very similar to the transformation that takes place 
in converting our high level code to actual silicon gates that are not 
so different from the operations of a NAND-CIRC program. Indeed, 
tools such as MyHDL that transform “Python to Silicon” can be used 
to convert a Python program to a NAND-CIRC program. 

The NAND-CIRC programming language is just a teaching tool, 
and by no means do I suggest that writing NAND-CIRC programs, or 
compilers to NAND-CIRC, is a practical, useful, or enjoyable activity. 
What I do want is to make sure you understand why it can be done, 
and to have the confidence that if your life (or at least your grade) 
depended on it, then you would be able to do this. Understanding 
how programs in high level languages such as Python are eventually 
transformed into concrete low-level representation such as NAND is 
fundamental to computer science. 

The astute reader might notice that the above paragraphs only 
outlined why it should be possible to find for every particular Python- 
computable function f, a particular comparably efficient NAND-CIRC 
program P that computes f. But this still seems to fall short of our 
goal of writing a “Python interpreter in NAND” which would mean 
that for every parameter n, we come up with a single NAND-CIRC 
program UNIV, such that given a description of a Python program 
P, a particular input x, and a bound T on the number of operations 
(where the lengths of P and z and the value of T are all at most s) 
returns the result of executing P on x for at most T steps. After all, 
the transformation above takes every Python program into a different 
NAND-CIRC program, and so does not yield “one NAND-CIRC pro- 
gram to rule them all” that can evaluate every Python program up to 
some given complexity. However, we can in fact obtain one NAND- 
CIRC program to evaluate arbitrary Python programs. The reason is 
that there exists a Python interpreter in Python: a Python program U 
that takes a bit string, interprets it as Python code, and then runs that 
code. Hence, we only need to show a NAND-CIRC program U* that 
computes the same function as the particular Python program U, and 
this will give us a way to evaluate all Python programs. 


€ ARM stands for “Advanced RISC Machine” where 
RISC in turn stands for “Reduced instruction set 
computer”. 
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What we are seeing time and again is the notion of universality or 
self reference of computation, which is the sense that all reasonably rich 
models of computation are expressive enough that they can “simulate 
themselves”. The importance of this phenomenon to both the theory 
and practice of computing, as well as far beyond it, including the 
foundations of mathematics and basic questions in science, cannot be 
overstated. 


5.6 THE PHYSICAL EXTENDED CHURCH-TURING THESIS (DISCUS- 
SION) 


We've seen that NAND gates (and other Boolean operations) can be 
implemented using very different systems in the physical world. What 
about the reverse direction? Can NAND-CIRC programs simulate any 
physical computer? 

We can take a leap of faith and stipulate that Boolean circuits (or 
equivalently NAND-CIRC programs) do actually encapsulate every 
computation that we can think of. Such a statement (in the realm of 
infinite functions, which we'll encounter in Chapter 7) is typically 
attributed to Alonzo Church and Alan Turing, and in that context 
is known as the Church-Turing Thesis. As we will discuss in future 
lectures, the Church-Iuring Thesis is not a mathematical theorem or 
conjecture. Rather, like theories in physics, the Church-Turing Thesis 
is about mathematically modeling the real world. In the context of 
finite functions, we can make the following informal hypothesis or 
prediction: 


“Physical Extended Church-Turing Thesis” (PECTT): Ifa function 

F : {0,1}" — {0,1} can be computed in the physical world using s amount 
of “physical resources” then it can be computed by a Boolean circuit program of 
roughly s gates. 


A priori it might seem rather extreme to hypothesize that our mea- 
ger model of NAND-CIRC programs or Boolean circuits captures all 
possible physical computation. But yet, in more than a century of 
computing technologies, no one has yet built any scalable computing 
device that challenges this hypothesis. 

We now discuss the “fine print” of the PECTT in more detail, as 
well as the (so far unsuccessful) challenges that have been raised 
against it. There is no single universally-agreed-upon formalization 
of “roughly s physical resources”, but we can approximate this notion 
by considering the size of any physical computing device and the 
time it takes to compute the output, and ask that any such device can 
be simulated by a Boolean circuit with a number of gates that is a 
polynomial (with not too large exponent) in the size of the system and 
the time it takes it to operate. 
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In other words, we can phrase the PECTT as stipulating that any 
function that can be computed by a device that takes a certain volume 
V of space and requires t time to complete the computation, must be 
computable by a Boolean circuit with a number of gates p(V, t) that is 
polynomial in V and t. 

The exact form of the function p(V, t) is not universally agreed 
upon but it is generally accepted that if f : {0,1}" — {0,1} isan 
exponentially hard function, in the sense that it has no NAND-CIRC 
program of fewer than, say, 2”/2 lines, then a demonstration of a phys- 
ical device that can compute in the real world f for moderate input 
lengths (e.g., n = 500) would be a violation of the PECTT. 


@) 


Remark 5.13 — Advanced note: making PECTT concrete 
(advanced, optional). We can attempt a more exact 
phrasing of the PECTT as follows. Suppose that Z is 

a physical system that accepts n binary stimuli and 

has a binary output, and can be enclosed in a sphere 
of volume V. We say that the system Z computes a 
function f : {0,1}" — {0,1} within t seconds if when- 
ever we set the stimuli to some valuex € {0,1}",if 
we measure the output after t seconds then we obtain 
f(a). 

One can then phrase the PECTT as stipulating that if 
there exists such a system Z that computes F within 

t seconds, then there exists a NAND-CIRC program 
that computes F and has at most a(Vt)? lines, where 

a is some normalization constant. (We can also con- 
sider variants where we use surface area instead 

of volume, or take (Vt) to a different power than 2. 
However, none of these choices makes a qualitative 
difference to the discussion below.) In particular, 
suppose that f : {0,1}" — {0,1} is a function that 
requires 2”/(100n) > 2°” lines for any NAND-CIRC 
program (such a function exists by Theorem 5.3). 
Then the PECTT would imply that either the volume 
or the time of a system that computes F will have to 
be at least 2°?” /,/a. Since this quantity grows expo- 
nentially in n, it is not hard to set parameters so that 
even for moderately large values of n, such a system 
could not fit in our universe. 

To fully make the PECTT concrete, we need to decide 
on the units for measuring time and volume, and the 
normalization constant a. One conservative choice is 
to assume that we could squeeze computation to the 
absolute physical limits (which are many orders of 
magnitude beyond current technology). This corre- 
sponds to setting a = 1 and using the Planck units 
for volume and time. The Planck length Lp (which is, 
roughly speaking, the shortest distance that can the- 
oretically be measured) is roughly 2~'*° meters. The 


5.6.1 Attempts at refuting the PECTT 
One of the admirable traits of mankind is the refusal to accept limita- 


tions. In the best case this is manifested by people achieving long- 
standing “impossible” challenges such as heavier-than-air flight, 
putting a person on the moon, circumnavigating the globe, or even 
resolving Fermat’s Last Theorem. In the worst case it is manifested by 
people continually following the footsteps of previous failures to try to 
do proven-impossible tasks such as build a perpetual motion machine, 
trisect an angle with a compass and straightedge, or refute Bell’s in- 
equality. The Physical Extended Church-Turing thesis (in its various 
forms) has attracted both types of people. Here are some physical 
devices that have been speculated to achieve computational tasks that 
cannot be done by not-too-large NAND-CIRC programs: 


e Spaghetti sort: One of the first lower bounds that Computer Sci- 
ence students encounter is that sorting n numbers requires making 
Q(nlog n) comparisons. The “spaghetti sort” is a description of a 
proposed “mechanical computer” that would do this faster. The 
idea is that to sort n numbers z4, ...,2,,, we could cut n spaghetti 
noodles into lengths x,,...,2,,, and then if we simply hold them 
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together in our hand and bring them down to a flat surface, they 
will emerge in sorted order. There are a great many reasons why 
this is not truly a challenge to the PECTT hypothesis, and I will not 
ruin the reader’s fun in finding them out by her or himself. 


e Soap bubbles: One function F : {0,1}" — {0, 1} that is conjectured 
to require a large number of NAND lines to solve is the Euclidean 
Steiner Tree problem. This is the problem where one is given m 
points in the plane (21,41), ---, (m; Ym) (say with integer coordi- 
nates ranging from 1 till m, and hence the list can be represented 
as a string of n = O(m log m) size) and some number K. The goal 
is to figure out whether it is possible to connect all the points by 
line segments of total length at most K. This function is conjec- 
tured to be hard because it is NP complete - a concept that we'll en- 
counter later in this course - and it is in fact reasonable to conjecture 
that as m grows, the number of NAND lines required to compute 
this function grows exponentially in m, meaning that the PECTT 
would predict that if m is sufficiently large (such as few hundreds 
or so) then no physical device could compute F. Yet, some people 
claimed that there is in fact a very simple physical device that could 
solve this problem, that can be constructed using some wooden 
pegs and soap. The idea is that if we take two glass plates, and put 
m wooden pegs between them in the locations (21,41), ---, (£m Ym) 
then bubbles will form whose edges touch those pegs in a way that 
will minimize the total energy which turns out to be a function of 
the total length of the line segments. The problem with this device 
is that nature, just like people, often gets stuck in “local optima”. 
That is, the resulting configuration will not be one that achieves 
the absolute minimum of the total energy but rather one that can’t 
be improved with local changes. Aaronson has carried out actual 
experiments (see Fig. 5.8), and saw that while this device often 
is successful for three or four pegs, it starts yielding suboptimal 
results once the number of pegs grows beyond that. 


e DNA computing. People have suggested using the properties of 
DNA to do hard computational problems. The main advantage of 
DNA is the ability to potentially encode a lot of information in a 
relatively small physical space, as well as compute on this infor- 
mation in a highly parallel manner. At the time of this writing, it 
was demonstrated that one can use DNA to store about 10! bits 


of information in a region of radius about a millimeter, as opposed . l , 
Figure 5.8: Scott Aaronson tests a candidate device for 

to about 10!° bits with the best known hard disk technology. This computing Steiner trees using soap bubbles. 

does not posit a real challenge to the PECTT but does suggest that 


one should be conservative about the choice of constant and not as- 


sume that current hard disk + silicon technologies are the absolute 
best possible.” 


Continuous/real computers. The physical world is often described 
using continuous quantities such as time and space, and people 
have suggested that analog devices might have direct access to 
computing with real-valued quantities and would be inherently 
more powerful than discrete models such as NAND machines. 
Whether the “true” physical world is continuous or discrete is an 
open question. In fact, we do not even know how to precisely phrase 
this question, let alone answer it. Yet, regardless of the answer, it 
seems clear that the effort to measure a continuous quantity grows 
with the level of accuracy desired, and so there is no “free lunch” 
or way to bypass the PECTT using such machines (see also this 
paper). Related to that are proposals known as “hypercomputing” 
or “Zeno’s computers” which attempt to use the continuity of time 
by doing the first operation in one second, the second one in half a 
second, the third operation in a quarter second and so on.. These 
fail for a similar reason to the one guaranteeing that Achilles will 
eventually catch the tortoise despite the original Zeno’s paradox. 


Relativity computer and time travel. The formulation above as- 
sumed the notion of time, but under the theory of relativity time is 
in the eye of the observer. One approach to solve hard problems is 
to leave the computer to run for a lot of time from his perspective, 
but to ensure that this is actually a short while from our perspective. 
One approach to do so is for the user to start the computer and then 
go for a quick jog at close to the speed of light before checking on 
its status. Depending on how fast one goes, few seconds from the 
point of view of the user might correspond to centuries in com- 
puter time (it might even finish updating its Windows operating 
system!). Of course the catch here is that the energy required from 
the user is proportional to how close one needs to get to the speed 
of light. A more interesting proposal is to use time travel via closed 
timelike curves (CTCs). In this case we could run an arbitrarily long 
computation by doing some calculations, remembering the current 
state, and then travelling back in time to continue where we left off. 
Indeed, if CTCs exist then we’d probably have to revise the PECTT 
(though in this case I will simply travel back in time and edit these 
notes, so I can claim I never conjectured it in the first place...) 


Humans. Another computing system that has been proposed as 
a counterexample to the PECTT is a 3 pound computer of about 
0.1m radius, namely the human brain. Humans can walk around, 
talk, feel, and do other things that are not commonly done by 
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NAND-CIRC programs, but can they compute partial functions 
that NAND-CIRC programs cannot? There are certainly compu- 
tational tasks that at the moment humans do better than computers 
(e.g., play some video games, at the moment), but based on our 
current understanding of the brain, humans (or other animals) 
have no inherent computational advantage over computers. The 
brain has about 10'! neurons, each operating at a speed of about 
1000 operations per seconds. Hence a rough first approximation is 


that a Boolean circuit of about 10'* gates could simulate one second 
8 This is a very rough approximation that could 

be wrong to a few orders of magnitude in either 
exists does not mean it is easy to find it. After all, constructing this direction. For one, there are other structures in the 
brain apart from neurons that one might need to 
simulate, hence requiring higher overhead. On the 
in artificial intelligence research is focused on finding programs other hand, it is by no mean clear that we need to 


that replicate some of the brain’s capabilities and they take massive fully clone the brain inorder to'achievè the Same 
computational tasks that it does. 


of a brain’s activity. Note that the fact that such a circuit (likely) 


circuit took evolution billions of years. Much of the recent efforts 


computational effort to discover, these programs often turn out to 
be much smaller than the pessimistic estimates above. For example, 
at the time of this writing, Google’s neural network for machine 
translation has about 10* nodes (and can be simulated by a NAND- 
CIRC program of comparable size). Philosophers, priests and many 
others have since time immemorial argued that there is something 
about humans that cannot be captured by mechanical devices such 
as computers; whether or not that is the case, the evidence is thin 


that humans can perform computational tasks that are inherently 
? There are some well known scientists that have 
advocated that humans have inherent computational 


advantages over computers. See also this. 


impossible to achieve by computers of similar complexity.” 


e Quantum computation. The most compelling attack on the Physi- 
cal Extended Church-Turing Thesis comes from the notion of quan- 
tum computing. The idea was initiated by the observation that sys- 
tems with strong quantum effects are very hard to simulate on a 
computer. Turning this observation on its head, people have pro- 
posed using such systems to perform computations that we do not 
know how to do otherwise. At the time of this writing, scalable 
quantum computers have not yet been built, but it is a fascinating 
possibility, and one that does not seem to contradict any known law 
of nature. We will discuss quantum computing in much more detail 
in Chapter 23. Modeling quantum computation involves extending 
the model of Boolean circuits into Quantum circuits that have one 
more (very special) gate. However, the main takeaway is that while 
quantum computing does suggest we need to amend the PECTT, 
it does not require a complete revision of our worldview. Indeed, 
almost all of the content of this book remains the same regardless of 
whether the underlying computational model is Boolean circuits or 
quantum circuits. 
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effects to speed-up computation, a model known as 
quantum computers. 


“What” (specification) “How” (implementation) 
Algorithm/Program/Circuit: 
Function: Boolean or NAND circuit C, 
f:{0,1}" > {0,1}" AON-CIRC or NAND-CIRC program P 
20(s logs) circuits of size < s 
e e 
e e : 7 o 4n 
e oR, $ SEn < 
e T stens @ <n’ lines 
XOR,@ ae . 


gman 


|ADD, @ <<——___ <n lines 
e e 
e 


saje3/saul| Jo Jaquinu 


9 © SIZE, (10 - 2") 


e 
e @ 10-2" lines 


Every function computed by Every program/circuit 
many circuits computes one function 


5.7 RECAP OF PART I: FINITE COMPUTATION 


This chapter concludes the first part of this book that deals with finite 
computation (computing functions that map a fixed number of Boolean 
inputs to a fixed number of Boolean outputs). The main take-aways 
from Chapter 3, Chapter 4, and Chapter 5 are as follows (see also 

Fig. 5.9): 


e We can formally define the notion of a function f : {0,1}" —> 
{0, 1}” being computable using s basic operations. Whether these 
operations are AND/OR/NOT, NAND, or some other universal 
basis does not make much difference. We can describe such a com- 
putation either using a circuit or using a straight-line program. 


e We define SIZE, ,,,(s) to be the set of functions that are computable 
by NAND circuits of at most s gates. This set is equal to the set of 
functions computable by a NAND-CIRC program of at most s lines 
and up to a constant factor in s (which we will not care about); 
this is also the same as the set of functions that are computable 
by a Boolean circuit of at most s AND/OR/NOT gates. The class 
SIZE, m(s) is a set of functions, not of programs/circuits. 


e Every function f : {0,1}" — {0,1}™” can be computed using a 
circuit of at most O(m - 2” /n) gates. Some functions require at least 
Q(m - 2"/n) gates. We define SIZE, ,,,(s) to be the set of functions 
from {0, 1}” to {0,1}” that can be computed using at most s gates. 


Figure 5.9: A finite computational task is specified by 
a function f : {0,1}”" — {0,1}™. We can model 

a computational process using Boolean circuits (of 
varying gate sets) or straight-line program. Every 
function can be computed by many programs. We 
say that f € SIZE, ,,,(s) if there exists a NAND 
circuit of at most s gates (equivalently a NAND-CIRC 
program of at most s lines) that computes f. Every 
function f : {0,1}” —> {0,1} can be computed by 
a circuit of O(m - 2” /n) gates. Many functions such 
as multiplication, addition, solving linear equations, 
computing the shortest path in a graph, and others, 
can be computed by circuits of much fewer gates. 

In particular there is an O(s? log s)-size circuit 

that computes the map C, x ++ C(x) where C is 

a string describing a circuit of s gates. However, 

the counting argument shows there do exist some 
functions f : {0,1}” — {0,1}™ that require 

Q(m - 2” /n) gates to compute. 
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e Wecan describe a circuit/ program P as a string. For every s, there 
is a universal circuit/program U, that can evaluate programs of 
length s given their description as strings. We can use this repre- 
sentation also to count the number of circuits of at most s gates and 
hence prove that some functions cannot be computed by circuits of 
smaller-than-exponential size. 


e If there is a circuit of s gates that computes a function f, then we 
can build a physical device to compute f using s basic components 
(such as transistors). The “Physical Extended Church-Iuring The- 
sis” postulates that the reverse direction is true as well: if f is a 
function for which every circuit requires at least s gates then that 
every physical device to compute f will require about s “physical 
resources”. The main challenge to the PECTT is quantum computing, 
which we will discuss in Chapter 23. 


Sneak preview: In the next part we will discuss how to model compu- 
tational tasks on unbounded inputs, which are specified using functions 
F : {0,1}* + {0,1}* (or F : {0,1}* > {0,1}) that can take an 
unbounded number of Boolean inputs. 


5.8 EXERCISES 


Exercise 5.1 Which one of the following statements is false: 


a. There is an O(s?) line NAND-CIRC program that given as input 
program P of s lines in the list-of-tuples representation computes 
the output of P when all its input are equal to 1. 


b. There is an O(s*) line NAND-CIRC program that given as input 
program P of s characters encoded as a string of 7s bits using the 
ASCII encoding, computes the output of P when all its input are 
equal to 1. 


c. There is an O(\/s) line NAND-CIRC program that given as input 
program P of s lines in the list-of-tuples representation computes 
the output of P when all its input are equal to 1. 


Exercise 5.2 — Equals function. For every k € N, show that there is an O(k) 
line NAND-CIRC program that computes the function EQUALS, : 
{0,1}?* + {0,1} where EQUALS(a, x’) = 1 if and only if x = 2’. 

a 
Exercise 5.3 — Equal to constant function. For every k € N and 2’ € {0,1}*, 
show that there is an O(k) line NAND-CIRC program that computes 
the function EQUALS,, : {0,1}* — {0,1} that on input x € {0,1}* 
outputs 1 if and only if x = x’. 
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Exercise 5.4 — Counting lower bound for multibit functions. Prove that there 
exists a number 6 > 0 such that for every n, m there exists a function 
f : {0,1}” > {0,1} that requires at least ôm - 2” /n NAND gates to 


compute. See footnote for hint.!° 


Exercise 5.5 — Size hierarchy theorem for multibit functions. Prove that there 
exists a number C such that for every n,m and n+m < s < m-2"/(Cn) 
there exists a function f € SIZE, m(C : 8) \ SIZE,, ,,(s). See footnote for 
hint. 


m 
Exercise 5.6 — Efficient representation of circuits and a tighter counting upper 
bound. Use the ideas of Remark 5.4 to show that for every « > 0 and 
sufficiently large s,n,m, 

|SIZE (s)| < 9(2+€)slog s+nlogn+m log s 
nym f 

Conclude that the implicit constant in Theorem 5.2 can be made arbi- 
trarily close to 5. See footnote for hint.1? 

m 


Exercise 5.7 — Tighter counting lower bound. Prove that for every ô < 1/2, if 
n is sufficiently large then there exists a function f : {0,1}" — {0,1} 
such that f ¢ SIZE, , (2%). See footnote for hint. 13 


Exercise 5.8 — Random functions are hard. Suppose n > 1000 and that we 
choose a function F : {0,1}” — {0, 1} at random, choosing for every 
x € {0,1}” the value F(x) to be the result of tossing an independent 
unbiased coin. Prove that the probability that there is a 2” /(1000n) 


line program that computes F is at most 2~ 100.14 


Exercise 5.9 The following is a tuple representing a NAND program: 


1 How many functions from {0, 1}” to {0, 1} exist? 


1 Follow the proof of Theorem 5.5, replacing the use 
of the counting argument with Exercise 5.4. 


12 Using the adjacency list representation, a graph 
with n in-degree zero vertices and s in-degree two 
vertices can be represented using roughly 2s log(s + 
n) < 2s(log s + O(1)) bits. The labeling of the n input 
and m output vertices can be specified by a list of n 
labels in [n] and m labels in [m]. 

13 Hint: Use the results of Exercise 5.6 and the fact that 
in this regime m = l and n « s. 


Hint: An equivalent way to say this is that you 
need to prove that the set of functions that can be 
computed using at most 2”/(10007) lines has fewer 
than 2710022” elements. Can you see why? 


(3, 1, ((8, 2, 2), (4, 1, 1), (5, 3, 4), (6, 2, 1), (7, 6, 6), (8, 0,0), (9, 7, 8), (10, 5, 0), (11,9, 10))). 


1. Write a table with the eight values P(000), P(001), P(010), P(011), 
P(100), P(101), P(110), P(111) in this order. 


2. Describe what the programs does in words. 


Exercise 5.10 — EVAL with XOR. For every sufficiently large n, let E,, : 
{0,1}"" — {0,1} be the function that takes an n?-length string that 
encodes a pair (P, x) where x € {0,1}" and P isa NAND program 
of n inputs, a single output, and at most n*-t lines, and returns the 
output of P on x.'> That is, E,,(P,x) = P(2). 


15 Note that if n is big enough, then it is easy to 
represent such a pair using n? bits, since we can 
represent the program using O(n!:! log n) bits, and 
we can always pad our representation to have exactly 
n? length. 


CODE AS DATA, DATA AS CODE 


Prove that for every sufficiently large n, there does not exist an XOR 
circuit C that computes the function E,,, where a XOR circuit has the 
XOR gate as well as the constants 0 and 1 (see Exercise 3.5). That is, 
prove that there is some constant n, such that for every n > ng and 
XOR circuit C of n? inputs and a single output, there exists a pair 
(P, x) such that C(P, x) # E,,(P,2). 


Exercise 5.11 — Learning circuits (challenge, optional, assumes more background). 
(This exercise assumes background in probability theory and/or 
machine learning that you might not have at this point. Feel free 

to come back to it at a later point and in particular after going over 
Chapter 18.) In this exercise we will use our bound on the number of 
circuits of size s to show that (if we ignore the cost of computation) 
every such circuit can be learned from not too many training samples. 
Specifically, if we find a size-s circuit that classifies correctly a training 
set of O(s log s) samples from some distribution D, then it is guaran- 
teed to do well on the whole distribution D. Since Boolean circuits 
model very many physical processes (maybe even all of them, if the 
(controversial) physical extended Church-Turing thesis is true), this 
shows that all such processes could be learned as well (again, ignor- 
ing the computation cost of finding a classifier that does well on the 
training data). 

Let D be any probability distribution over {0,1}”" and let C bea 
NAND circuit with n inputs, one output, and size s > n. Prove that 
there is some constant c such that with probability at least 0.999 the 
following holds: if m = cs log s and zo, ... ,#,,, are chosen indepen- 
dently from D, then for every circuit C” such that C’(z,) = C(a,) on 
every i € [m], Pr, p[C’ (x) < C(x)] < 0.99. 

In other words, if C” is a so called “empirical risk minimizer” that 
agrees with C on all the training examples 29, -.- , £n—1, then it will 
also agree with C with high probability for samples drawn from the 
distribution D (i.e., it “generalizes”, to use Machine-Learning lingo). 


t.16 


See footnote for hin s/circuits of size s (Theorem 5.2), as well as the 


Chernoff Bound ( Theorem 18.12) and the union 
bound. 


5.9 BIBLIOGRAPHICAL NOTES 


The EVAL function is usually known as a universal circuit. The imple- 
mentation we describe is not the most efficient known. Valiant [Val76] 
first showed a universal circuit of size O(n logn) where n is the size of 
the input. Universal circuits have seen in recent years new motivations 
due to their applications for cryptography, see [LMS16; GKS17] . 
While we've seen that “most” functions mapping n bits to one bit 
require circuits of exponential size 0.(2”/n), we actually do not know 
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16 Hint: Use our bound on the number of program- 
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of any explicit function for which we can prove that it requires, say, at 


least 11° 


or even 100n size. At the moment, the strongest such lower 
bound we know is that there are quite simple and explicit n-variable 
functions that require at least (5 — o(1))n lines to compute, see this 
paper of Iwama et al as well as this more recent work of Kulikov et al. 
Proving lower bounds for restricted models of circuits is an extremely 
interesting research area, for which Jukna’s book [Juk12] (see also 
Wegener [Weg87]) provides a very good introduction and overview. I 
learned of the proof of the size hierarchy theorem (Theorem 5.5) from 
Sasha Golovnev. 

Scott Aaronson’s blog post on how information is physical is a good 
discussion on issues related to the physical extended Church-Turing 
Physics. Aaronson’s survey on NP complete problems and physical 
reality [Aar05] discusses these issues as well, though it might be 
easier to read after we reach Chapter 15 on NP and NP-completeness. 


II 
UNIFORM COMPUTATION 


Learning Objectives: 


e Define functions on unbounded length inputs, 
that cannot be described by a finite size table 
of inputs and outputs. 


Equivalence with the task of deciding 
membership in a language. 


Deterministic finite automatons (optional): A 
simple example for a model for unbounded 
computation. 


e Equivalence with regular expressions. 


Functions with Infinite domains, Automata, and Regular ex- 


pressions 


“An algorithm is a finite answer to an infinite number of questions.”, At- 
tributed to Stephen Kleene. 


The model of Boolean circuits (or equivalently, the NAND-CIRC 
programming language) has one very significant drawback: a Boolean 
circuit can only compute a finite function f. In particular, since every 
gate has two inputs, a size s circuit can compute on an input of length 
at most 2s. Thus this model does not capture our intuitive notion of an 
algorithm as a single recipe to compute a potentially infinite function. 
For example, the standard elementary school multiplication algorithm 
is a single algorithm that multiplies numbers of all lengths. However, 
we cannot express this algorithm as a single circuit, but rather need a 
different circuit (or equivalently, a NAND-CIRC program) for every 
input length (see Fig. 6.1). 

In this chapter, we extend our definition of computational tasks to 
consider functions with the unbounded domain of {0, 1}*. We focus 
on the question of defining what tasks to compute, mostly leaving 
the question of how to compute them to later chapters, where we will 
see Turing machines and other computational models for computing 
on unbounded inputs. However, we will see one example of a sim- 
ple restricted model of computation - deterministic finite automata 
(DFAs). 


Compiled on 12.19.2022 22:58 


World's most boring fourth grade class 


‘oday we'll learn ‘oday we'll learn ‘oday we'll learn 
to multiply 1 digit to multiply 2 digit to multiply 3 digit 
numbers numbers numbers 


Day 365 


‘oday we'll learn 
to multiply 4 digit 


numbers 


loday we'll learn 


to multiply 365 
digit numbers 
23456/6/... x ji 


57673432. 


Figure 6.1: Once you know how to multiply multi- 
digit numbers, you can do so for every number n 

of digits, but if you had to describe multiplication 
using Boolean circuits or NAND-CIRC programs, 
you would need a different program/circuit for every 
length n of the input. 
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6.1 FUNCTIONS WITH INPUTS OF UNBOUNDED LENGTH 


Up until now, we considered the computational task of mapping 
some string of length n into a string of length m. However, in gen- 
eral, computational tasks can involve inputs of unbounded length. 
For example, the following Python function computes the function 
XOR : {0,1}* — {0,1}, where XOR(z) equals 1 iff the number of 1’s 
in x is odd. (In other words, XOR(x) = ary 
x € {0,1}*.) As simple as it is, the XOR function cannot be com- 
puted by a Boolean circuit. Rather, for every n, we can compute XOR,, 
(the restriction of XOR to {0, 1}”) using a different circuit (e.g., see 
Fig. 6.2). 


x, mod 2 for every 


def XOR(X): 
'''Takes list X of @'s and 1's 
Outputs 1 if the number of 1's is odd and outputs @ 
ə otherwise''' 
result = ð 
for i in range(len(X)): 
result = (result + X[i]) % 2 
return result 


Previously in this book, we studied the computation of finite func- 
tions f : {0,1}” — {0,1}™. Such a function f can always be described 
by listing all the 2” values it takes on inputs x € {0,1}”. In this chap- 
ter, we consider functions such as XOR that take inputs of unbounded 
size. While we can describe XOR using a finite number of symbols 
(in fact, we just did so above), it takes infinitely many possible in- 
puts, and so we cannot just write down all of its values. The same is 
true for many other functions capturing important computational 
tasks, including addition, multiplication, sorting, finding paths in 


Temp(0} = NANOCXCO},XC1]) 
‘NANO(XC0}, TempL0) 
02) 


Figure 6.2: The NAND circuit and NAND-CIRC 
program for computing the XOR of 5 bits. Note how 
the circuit for XOR; merely repeats four times the 
circuit to compute the XOR of 2 bits. 
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graphs, fitting curves to points, and so on. To contrast with the fi- 
nite case, we will sometimes call a function F : {0,1}* — {0,1} (or 
F : {0,1}* — {0,1}*) infinite. However, this does not mean that F 
takes as input strings of infinite length! It just means that F can take 
as input a string of that can be arbitrarily long, and so we cannot sim- 
ply write down a table of all the outputs of F on different inputs. 


As we have seen before, restricting attention to functions that use 
binary strings as inputs and outputs does not detract from our gener- 
ality, since other objects, including numbers, lists, matrices, images, 
videos, and more, can be encoded as binary strings. 

As before, it is essential to differentiate between specification and 
implementation. For example, consider the following function: 


TWINP(x) = 1 Anew Stp, p + 2 are primes and p > |x| 
0 otherwise 


This is a mathematically well-defined function. For every z, 
TWINP() has a unique value which is either 0 or 1. However, at 
the moment, no one knows of a Python program that computes this 
function. The Twin prime conjecture posits that for every n there 
exists p > n such that both p and p + 2 are primes. If this conjecture 
is true, then T is easy to compute indeed - the program def T(x): 
return 1 will do the trick. However, mathematicians have tried 
unsuccessfully to prove this conjecture since 1849. That said, whether 
or not we know how to implement the function TWINP, the definition 
above provides its specification. 


6.1.1 Varying inputs and outputs 
Many of the functions that interest us take more than one input. For 
example, the function 


MULT(2,y) =x- y 


takes the binary representation of a pair of integers x,y € N, and 
outputs the binary representation of their product x- y. However, since 
we can represent a pair of strings as a single string, we will consider 
functions such as MULT as mapping {0, 1}* to {0, 1}*. We will typi- 
cally not be concerned with low-level details such as the precise way 
to represent a pair of integers as a string, since virtually all choices will 
be equivalent for our purposes. 
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Another example of a function we want to compute is 


L Viefley%i = Tiei 


PALINDROME(«) = 
0 otherwise 


PALINDROME has a single bit as output. Functions with a single 
bit of output are known as Boolean functions. Boolean functions are 
central to the theory of computation, and we will discuss them often 
in this book. Note that even though Boolean functions have a single 
bit of output, their input can be of arbitrary length. Thus they are still 
infinite functions that cannot be described via a finite table of values. 

“Booleanizing” functions. Sometimes it might be convenient to ob- 
tain a Boolean variant for a non-Boolean function. For example, the 
following is a Boolean variant of MULT. 


i” bitofa-y i<|x-y| 


BMULT(z, y, 7) = 
0 otherwise 


If we can compute BMULT via any programming language such as 
Python, C, Java, etc., we can compute MULT as well, and vice versa. 


Solved Exercise 6.1 — Booleanizing general functions. Show that for every 
function F : {0,1}* — {0,1}*, there exists a Boolean function BF : 
{0, 1}* — {0,1} such that a Python program to compute BF can be 
transformed into a program to compute F and vice versa. 


Solution: 
For every F : {0,1}* — {0,1}*, we can define 


F(a), i<|F(x)|,b=0 
BF(a,1,b) = 41 i<|F(x)|,b=1 
0 i > |F(a)| 


to be the function that oninputx € {0,1}*,i € N,b € {0,1} 
outputs the i” bit of F(x) ifb = Oandi < |F(x)|. Ifb = 1, then 
BF(a,i,b) outputs 1 iff i < |F (x)| and hence this allows to compute 
the length of F(z). 

Computing BF from F is straightforward. For the other direc- 


tion, given a Python function BF that computes BF, we can compute 
F as follows: 


def F(x): 
res = [] 
i= 20 
while BF(x,i,1): 
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res.append(BF(x,i,2)) 
i += 1 


return res 


6.1.2 Formal Languages 

For every Boolean function F : {0,1}* —> {0,1}, we can define the set 
Ly = {x|F(x) = 1} of strings on which F outputs 1. Such sets are 
known as languages. This name is rooted in formal language theory as 
pursued by linguists such as Noam Chomsky. A formal language is a 
subset L C {0,1}* (or more generally L C &* for some finite alphabet 
£). The membership or decision problem for a language L, is the task of 
determining, given x € {0,1}*, whether or not x € L. If we can com- 
pute the function F, then we can decide membership in the language 
Lp and vice versa. Hence, many texts such as [Sip97] refer to the task 
of computing a Boolean function as “deciding a language”. In this 
book, we mostly describe computational tasks using the function nota- 
tion, which is easier to generalize to computation with more than one 
bit of output. However, since the language terminology is so popular 
in the literature, we will sometimes mention it. 


6.1.3 Restrictions of functions 

If F : {0,1}* — {0,1} is a Boolean function and n € N then the re- 
striction of F to inputs of length n, denoted as F,„, is the finite function 
f : {0,1}" > {0,1} such that f(x) = F(x) for every x € {0,1}”. That 
is, F, is the finite function that is only defined on inputs in {0, 1}”, but 
agrees with F on those inputs. Since F, is a finite function, it can be 
computed by a Boolean circuit, implying the following theorem: 


Theorem 6.1 — Circuit collection for every infinite function. Let F : {0,1}* > 
{0,1}. Then there is a collection {C,, }ne{1,2,...} of circuits such that 
foreveryn > 0, C, computes the restriction F,, of F to inputs of 
length n. 


Proof. This is an immediate corollary of the universality of Boolean 
circuits. Indeed, since F,, maps {0,1}” to {0,1}, Theorem 4.15 implies 
that there exists a Boolean circuit C„ to compute it. In fact, the size of 
this circuit is at most c - 2” /n gates for some constant c < 10. 

a 


In particular, Theorem 6.1 implies that there exists such a circuit 
collection {C } even for the TWINP function we described before, 
even though we do not know of any program to compute it. Indeed, 
this is not that surprising: for every particular n € N, TWINP,, is either 
the constant zero function or the constant one function, both of which 
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can be computed by very simple Boolean circuits. Hence a collection 
of circuits {C } that computes TWINP certainly exists. The difficulty 
in computing TWINP using Python or any other programming lan- 
guage arises from the fact that we do not know for each particular n 
what is the circuit C, in this collection. 


6.2 DETERMINISTIC FINITE AUTOMATA (OPTIONAL) 


All our computational models so far - Boolean circuits and straight- 
line programs - were only applicable for finite functions. 

In Chapter 7, we will present Turing machines, which are the central 
models of computation for unbounded input length functions. How- 
ever, in this section we present the more basic model of deterministic 
finite automata (DFA). Automata can serve as a good stepping-stone for 
Turing machines, though they will not be used much in later parts of 
this book, and so the reader can feel free to skip ahead to Chapter 7. 
DFAs turn out to be equivalent in power to regular expressions: a pow- 
erful mechanism to specify patterns, which is widely used in practice. 
Our treatment of automata is relatively brief. There are plenty of re- 
sources that help you get more comfortable with DFAs. In particular, 
Chapter 1 of Sipser’s book [Sip97] contains an excellent exposition of 
this material. There are also many websites with online simulators for 
automata, as well as translators from regular expressions to automata 
and vice versa (see for example here and here). 

At a high level, an algorithm is a recipe for computing an output 
from an input via a combination of the following steps: 


1. Read a bit from the input 
2. Update the state (working memory) 
3. Stop and produce an output 


For example, recall the Python program that computes the XOR 
function: 


def XOR(X): 
'''Takes list X of @'s and 1's 
Outputs 1 if the number of 1's is odd and outputs @ 
ə otherwise''' 
result = ð 
for i in range(len(X)): 
result = (result + X[i]) % 2 
return result 


In each step, this program reads a single bit X[i] and updates its 
state result based on that bit (flipping result if X[i] is 1 and keep- 
ing it the same otherwise). When it is done transversing the input, 
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the program outputs result. In computer science, such a program is 
called a single-pass constant-memory algorithm since it makes a single 
pass over the input and its working memory is finite. (Indeed, in this 
case, result can either be 0 or 1.) Such an algorithm is also known as 
a Deterministic Finite Automaton or DFA (another name for DFAs is a 
finite state machine). We can think of such an algorithm as a “machine” 
that can be in one of C states, for some constant C. The machine starts 
in some initial state and then reads its input x € {0,1}* one bit ata 
time. Whenever the machine reads a bit o € {0, 1}, it transitions into a 
new state based on o and its prior state. The output of the machine is 
based on the final state. Every single-pass constant-memory algorithm 
corresponds to such a machine. If an algorithm uses c bits of mem- 
ory, then the contents of its memory can be represented as a string 
of length c. Therefore such an algorithm can be in one of at most 2° 
states at any point in the execution. 

We can specify a DFA of C states by a list of C - 2 rules. Each rule 
will be of the form “If the DFA is in state v and the bit read from the 


In 


input is o then the new state is v’”. At the end of the computation, 
we will also have a rule of the form “If the final state is one of the 
following ... then output 1, otherwise output 0”. For example, the 
Python program above can be represented by a two-state automaton 


for computing XOR of the following form: 


e Initialize in the state 0. 

e For every state s € {0,1} and input bit o read, if ø = 1 then change 
to state 1 — s, otherwise stay in state s. 

e At the end output 1 iff s = 1. 


We can also describe a C-state DFA as a labeled graph of C vertices. 

For every state s and bit ø, we add a directed edge labeled with o 

between s and the state s’ such that if the DFA is at state s and reads o 

then it transitions to s’. (If the state stays the same then this edge will 

be a self-loop; similarly, if s transitions to s’ in both the case ø = 0 and 

c = 1 then the graph will contain two parallel edges.) We also label 

the set S of states on which the automaton will output 1 at the end of 

the computation. This set is known as the set of accepting states. See 0 0 

Fig. 6.3 for the graphical representation of the XOR automaton. QO 1 C) 
Formally, a DFA is specified by (1) the table of the C - 2 rules, which 

can be represented as a transition function T that maps a state s € [C] 

and bit o € {0,1} to the state s’ € [C] which the DFA will transition to 

from state s on input c and (2) the set S of accepting states. This leads w 

to the following definition. 1 


Figure 6.3: A deterministic finite automaton that 
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computes the XOR function. It has two states 0 and 1, 
and when it observes ø it transitions from v to v @ a. 
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Definition 6.2 — Deterministic Finite Automaton. A deterministic finite 
automaton (DFA) with C states over {0, 1} is a pair (T, 5) with 

T : [C] x {0,1} > [C] and S C [C]. The finite function T is known 
as the transition function of the DFA. The set S is known as the set of 
accepting states. 

Let F : {0,1}* — {0,1} be a Boolean function with the infinite 
domain {0,1}*. We say that (T, 5) computes a function F : {0,1}* —> 
{0,1} if foreveryn € Nanda e€ {0,1}", if we define sy = O and 
$;4, = T(s;,2;,) for every i € [n], then 


8, € 8 & F(a) =1 


Solved Exercise 6.2 — DFA for (010)*. Prove that there is a DFA that com- 
putes the following function F: 


F(a) z 1 38 divides jz] and V iel|a|/3]%3i%3i412 3142 = 010 
0 otherwise 


Solution: 
When asked to construct a deterministic finite automaton, it is 
often useful to start by constructing a single-pass constant-memory 
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algorithm using a more general formalism (for example, using 
pseudocode or a Python program). Once we have such an algo- 
rithm, we can mechanically translate it into a DFA. Here is a simple 
Python program for computing F: 


def F(X): 

'''Return 1 iff X is a concatenation of zero/more 

a copies of [0,1,@]''' 

if len(X) % 3 != 0: 
return False 

ultimate = ð 

penultimate = 1 

antepenultimate = ð 

for idx, b in enumerate(X): 
antepenultimate = penultimate 
penultimate = ultimate 
ultimate = b 
if idx % 3 == 2 and ((antepenultimate, 
o penultimate, ultimate) != (0,1,0)): 

return False 
return True 


Since we keep three Boolean variables, the working memory can 
be in one of 2 = 8 configurations, and so the program above can 
be directly translated into an 8 state DFA. While this is not needed 
to solve the question, by examining the resulting DFA, we can see 
that we can merge some states and obtain a 4 state automaton, de- 
scribed in Fig. 6.4. See also Fig. 6.5, which depicts the execution of 
this DFA on a particular input. 


6.2.1 Anatomy of an automaton (finite vs. unbounded) 

Now that we are considering computational tasks with unbounded 
input sizes, it is crucial to distinguish between the components of our 
algorithm that have fixed length and the components that grow with 
the input size. For the case of DFAs these are the following: 


Constant size components: Given a DFA A, the following quantities are 
fixed independent of the input size: 


e The number of states C in A. 


e The transition function T (which has 2C inputs, and so can be speci- 
fied by a table of 2C rows, each entry in which is a number in [C}). 


e The set S C [C] of accepting states. This set can be described by a 


string in {0, 1}° specifiying which states are in § and which are not. 


Figure 6.4: A DFA that outputs 1 only on inputs 

x € {0, 1}* that are a concatenation of zero or more 
copies of 010. The state 0 is both the starting state 
and the only accepting state. The table denotes the 
transition function of T, which maps the current state 
and symbol read to the new symbol. 
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Together the above means that we can fully describe an automaton 
using finitely many symbols. This is a property we require out of any 
notion of “algorithm”: we should be able to write down a complete 
specification of how it produces an output from an input. 


Components of unbounded size: The following quantities relating to a 
DFA are not bounded by any constant. We stress that these are still 
finite for any given input. 


e The length of the input x € {0,1}* that the DFA is provided. The 
input length is always finite, but not a priori bounded. 


e The number of steps that the DFA takes can grow with the length of 
the input. Indeed, a DFA makes a single pass on the input and so it 
takes precisely |x| steps on an input x € {0, 1}*. 


If in state 1 and see 
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Bounded number of states, 
transition function has bounded size 


6.2.2 DFA-computable functions 

We say that a function F : {0,1}* — {0,1} is DFA computable if there 
exists some DFA that computes F. In Chapter 4 we saw that every 
finite function is computable by some Boolean circuit. Thus, at this 
point, you might expect that every infinite function is computable by 
some DFA. However, this is very much not the case. We will soon see 
some simple examples of infinite functions that are not computable by 
DFAs, but for starters, let us prove that such functions exist. 


Theorem 6.4 — DFA-computable functions are countable. Let DFACOMP be 
the set of all Boolean functions F : {0,1}* — {0,1} such that there 
exists a DFA computing F. Then DFACOMP is countable. 


Proof Idea: 


Figure 6.5: Execution of the DFA of Fig. 6.4. The 
number of states and the transition function size are 
bounded, but the input can be arbitrarily long. If 

the DFA is at state s and observes the value o then it 
moves to the state T'(s, 0). At the end of the execution 
the DFA accepts iff the final state is in S. 
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Every DFA can be described by a finite length string, which yields 
an onto map from {0, 1}* to DFACOMP: namely, the function that 
maps a string describing an automaton A to the function that it com- 
putes. 

* 


Proof of Theorem 6.4. Every DFA can be described by a finite string, 
representing the transition function T and the set of accepting states, 
and every DFA A computes some function F : {0,1}* > {0,1}. Thus 
we can define the following function StDC : {0,1}* > DFACOMP: 


F a represents automaton A and F is the function A computes 


StDC (a) = 
ONE otherwise 


where ONE : {0,1}* — {0,1} is the constant function that outputs 
1 on all inputs (and is a member of DFACOMP). Since by definition, 
every function F in DFACOMP is computable by some automaton, 
StDC is an onto function from {0, 1}* to DFACOMP, which means 
that DFACOMP is countable (see Section 2.4.2). 


Since the set of all Boolean functions is uncountable, we get the 
following corollary: 


Theorem 6.5 — Existence of DFA-uncomputable functions. There exists a 
Boolean function F : {0,1}* — {0,1} that is not computable by any 
DFA. 


Proof. If every Boolean function F is computable by some DFA, then 
DFACOMP equals the set ALL of all Boolean functions, but by Theo- 
rem 2.12, the latter set is uncountable, contradicting Theorem 6.4. 

| 


6.3 REGULAR EXPRESSIONS 


Searching for a piece of text is a common task in computing. At its 
heart, the search problem is quite simple. We have a collection X = 

{£0; --- , £p} of strings (e.g., files on a hard-drive, or student records in 
a database), and the user wants to find out the subset of all the x € X 
that are matched by some pattern (e.g., all files whose names end with 
the string . txt). In full generality, we can allow the user to specify the 
pattern by specifying a (computable) function F : {0,1}* — {0,1}, 
where F(x) = 1 corresponds to the pattern matching x. That is, the 
user provides a program P in a programming language such as Python, 
and the system returns all x € X such that P(x) = 1. For example, 
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one could search for all text files that contain the string important 
document or perhaps (letting P correspond to a neural-network based 
classifier) all images that contain a cat. However, we don’t want our 
system to get into an infinite loop just trying to evaluate the program 
P! For this reason, typical systems for searching files or databases do 
not allow users to specify the patterns using full-fledged programming 
languages. Rather, such systems use restricted computational models that 
on the one hand are rich enough to capture many of the queries needed 
in practice (e.g., all filenames ending with . txt, or all phone numbers 
of the form (617) xxx-xxxx), but on the other hand are restricted 
enough so that queries can be evaluated very efficiently on huge files 
and in particular cannot result in an infinite loop. 

One of the most popular such computational models is regular 
expressions. If you ever used an advanced text editor, a command-line 
shell, or have done any kind of manipulation of text files, then you 
have probably come across regular expressions. 

A regular expression over some alphabet © is obtained by combin- 
ing elements of © with the operation of concatenation, as well as | 
(corresponding to or) and * (corresponding to repetition zero or 
more times). (Common implementations of regular expressions in 
programming languages and shells typically include some extra oper- 
ations on top of | and x, but these operations can be implemented as 
“syntactic sugar” using the operators | and x.) For example, the fol- 
lowing regular expression over the alphabet {0, 1} corresponds to the 
set of all strings x € {0, 1}* where every digit is repeated at least twice: 


(00(0*)|11(1*))* . 


The following regular expression over the alphabet {a,...,z,0,..., 9} 
corresponds to the set of all strings that consist of a sequence of one 
or more of the letters a-d followed by a sequence of one or more digits 
(without a leading zero): 


(a[b|c|d)(alb|e|d)*(1|2|3]4]5]6]7/8|9) (0]112314]516|71819)". (6.1) 


Formally, regular expressions are defined by the following recursive 
definition: 


Definition 6.6 — Regular expression. A regular expression e over an al- 
phabet ¥ is a string over% U {(,),|,*,@,""} that has one of the 
following forms: 


1. e=oawhereo EX} 


2. e= (e’|e”) where e’, e” are regular expressions. 
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3.e = (e’)(e”) where e’,e” are regular expressions. (We often 
drop the parentheses when there is no danger of confusion and 
so write this as e’ e”.) 


4. e= (e’)* where e’ is a regular expression. 


Finally we also allow the following “edge cases”:e = @and 
e = "". These are the regular expressions corresponding to accept- 
ing no strings, and accepting only the empty string respectively. 


We will drop parentheses when they can be inferred from the 
context. We also use the convention that OR and concatenation are 
left-associative, and we give highest precedence to *, then concate- 
nation, and then OR. Thus for example we write 00*|11 instead of 
COKA). 

Every regular expression e corresponds to a function ®, : X* > 
{0, 1} where (x) = 1 if x matches the regular expression. For exam- 
ple, if e = (00|11)* then ®,(110011) = 1 but 6,(101) = 0 (can you see 
why?). 


Definition 6.7 — Matching a regular expression. Let e be a regular expres- 
sion over the alphabet £. The function ®, : &* — {0,1} is defined 
as follows: 


1. Ife = o then ©, (x) = 1 iff x =o. 


2. Ife = (e'|e”) then ®,(x) = ®,,(x) Ver (x) where V is the OR op- 
erator. 


3. Ife = (e’)(e”) then (x) = 1 iff there is some z’, z” € X* such 
that x is the concatenation of x’ and x” and y(x’) = r(x”) = 
ih, 


4. Ife = (e’)xthen®,(%) = liffthereissomek € N and some 
Zos -++>L~-1 E U* such that v is the concatenation £o £p; and 
®.,(x,;) = 1 for every i € [k]. 


e 


5. Finally, for the edge cases ®, is the constant zero function, and 
®.. is the function that only outputs 1 on the empty string "". 
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We say that a regular expression e over X matches a string x € * 
ihe Ge) = I, 


A Boolean function is called “regular” if it outputs 1 on precisely 
the set of strings that are matched by some regular expression. That is, 


Definition 6.8 — Regular functions / languages. Let © be a finite set and 
F : &* + {0,1} bea Boolean function. We say that F is regular if 
F = ©, for some regular expression e. 

Similarly, for every formal language L C X*, we say that L is reg- 
ular if and only if there is a regular expression e such that « € L iff 
e matches zx. 


m Example 6.9 — A regular function. Let © = {a, b,c, d, 0, 1,2,3,4, 5,6,7,8,9} 
and F : &* — {0,1} be the function such that F(x) outputs 1 iff 

x consists of one or more of the letters a-d followed by a sequence 

of one or more digits (without a leading zero). Then F is a regular 
function, since F = ®, where 


e = (a|b|e|d)(alb|e|d)* (0|1|2]3]4[5|6]7|8]9) (0]1|2|3|4]5]6|7/8|9)* 


is the expression we saw in (6.1). 

If we wanted to verify, for example, that ®,(abc12078) = 1, 
we can do so by noticing that the expression (a|b|c|d) matches 
the string a, (a|b|c|d)* matches bc, (0|1|2|3|4|5|6|7|8|9) matches the 
string 1, and the expression (0|1|2|3|4|5|6|7|8|9)* matches the string 
2078. Each one of those boils down to a simpler expression. For ex- 
ample, the expression (a|b|c|d)* matches the string bc because both 
of the one-character strings b and c are matched by the expression 
alb|cld. 


Regular expression can be defined over any finite alphabet X, but 
as usual, we will mostly focus our attention on the binary case, where 
x = {0,1}. Most (if not all) of the theoretical and practical general 
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insights about regular expressions can be gleaned from studying the 
binary case. 


6.3.1 Algorithms for matching regular expressions 

Regular expressions would not be very useful for search if we could 
not evaluate, given a regular expression e, whether a string « is 
matched by e. Luckily, there is an algorithm to do so. Specifically, 
there is an algorithm (think “Python program” though later we 
will formalize the notion of algorithms using Turing machines) that 
on input a regular expression e over the alphabet {0,1} and a string 
x € {0,1}*, outputs 1 iff e matches z (i.e., outputs ®,(z)). 

Indeed, Definition 6.7 actually specifies a recursive algorithm for 
computing ®,. Specifically, each one of our operations -concatenation, 
OR, and star- can be thought of as reducing the task of testing whether 
an expression e matches a string x to testing whether some sub- 
expressions of e match substrings of x. Since these sub-expressions 
are always shorter than the original expression, this yields a recursive 
algorithm for checking if e matches x, which will eventually terminate 
at the base cases of the expressions that correspond to a single symbol 
or the empty string. 
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Algorithm 6.10 — Regular expression matching. 


Input: Regular expression e over &*, x € X* 
Output: ®,(x) 

1: procedure MatcuH(e,z) 

De if e = Ø then return 0; 
3 if x = "" then return MatcHEmpty(e) ; 
4: if e € X then return 1 iff x =e; 
5 


if e = (e'|e”) then return Matcu(e’, x) or MartcH(e”, x) 


6: if e = (e’)(e”) then 
7: for i € [|x|] do 
if Matcu(e’, % + £i_1) and Matcu(e”, £i: Eej) 
then return 1 ; 
9: end for 
10: end if 
11: if e = (e’)* then 
12: if ec’ = "" then return MatcuH("", x); 
13: 
14: for i € [|x|] do 
ils i 
16: if MATCH(e, £o Xi 1) and Matcu(e’, £- Liei) 
then return 1 ; 
17: end for 
18: end if 
19: return 0 


20: end procedure 


We assume above that we have a procedure MatcHEmpty that 
on input a regular expression e outputs 1 if and only if e matches the 
empty string "". 

The key observation is that in our recursive definition of regular ex- 
pressions, whenever e is made up of one or two expressions e’, e” then 
these two regular expressions are smaller than e. Eventually (when 
they have size 1) then they must correspond to the non-recursive 
case of a single alphabet symbol. Correspondingly, the recursive calls 
made in Algorithm 6.10 always correspond to a shorter expression or 
(in the case of an expression of the form (e’)*) a shorter input string. 
Thus, we can prove the correctness of Algorithm 6.10 on inputs of the 
form (e, x) by induction over min{|e|, |x|}. The base case is when ei- 
ther x = "" or eis a single alphabet symbol, "" or Ø. In the case the 
expression is of the form e = (e’|e”) or e = (e’)(e”), we make recursive 
calls with the shorter expressions e’, e”. In the case the expression is of 
the form e = (e’)*, we make recursive calls with either a shorter string 
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x and the same expression, or with the shorter expression e’ and a 
string x’ that is equal in length or shorter than z. 


Solved Exercise 6.3 — Match the empty string. Give an algorithm that on 
input a regular expression e, outputs 1 if and only if ®,("") = 1. 


a 
Solution: 
We can obtain such a recursive algorithm by using the following 
observations: 
1. An expression of the form "" or (e’)* always matches the empty 
string. 
2. An expression of the form o, whereo € is an alphabet sym- 
bol, never matches the empty string. 
3. The regular expression Ø does not match the empty string. 
4. An expression of the form e’|e” matches the empty string if and 
only if one of e’ or e” matches it. 
5. An expression of the form (e’)(e”) matches the empty string if 
and only if both e’ and e” match it. 
Given the above observations, we see that the following algo- 
rithm will check if e matches the empty string: 
a 


6.4 EFFICIENT MATCHING OF REGULAR EXPRESSIONS (OP- 
TIONAL) 


Algorithm 6.10 is not very efficient. For example, given an expression 
involving concatenation or the “star” operation and a string of length 
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n, it can make n recursive calls, and hence it can be shown that in the 
worst case Algorithm 6.10 can take time exponential in the length of 
the input string x. Fortunately, it turns out that there is a much more 
efficient algorithm that can match regular expressions in linear (i.e., 
O(n)) time. Since we have not yet covered the topics of time and space 
complexity, we describe this algorithm in high level terms, without 
making the computational model precise. Rather we will use the 
colloquial notion of O(n) running time as used in introduction to 
programming courses and whiteboard coding interviews. We will see 
a formal definition of time complexity in Chapter 13. 


Theorem 6.12 — Matching regular expressions in linear time. Let e be a 
regular expression. Then there is an O(n) time algorithm that 
computes P.. 


The implicit constant in the O(n) term of Theorem 6.12 depends on 
the expression e. Thus, another way to state Theorem 6.12 is that for 
every expression e, there is some constant c and an algorithm A that 
computes ®, on n-bit inputs using at most c-n steps. This makes sense 
since in practice we often want to compute ®,(x) for a small regular 
expression e and a large document x. Theorem 6.12 tells us that we 
can do so with running time that scales linearly with the size of the 
document, even if it has (potentially) worse dependence on the size of 
the regular expression. 

We prove Theorem 6.12 by obtaining more efficient recursive al- 
gorithm, that determines whether e matches a string x € {0,1}" by 
reducing this task to determining whether a related expression e’ 
matches 2, ...,%,_9. This will result in an expression for the running 
time of the form T(n) = T(n — 1) + O(1) which solves to T(n) = O(n). 


Restrictions of regular expressions. The central definition for the algo- 
rithm behind Theorem 6.12 is the notion of a restriction of a regular 
expression. The idea is that for every regular expression e and sym- 
bol ø in its alphabet, it is possible to define a regular expression e[o] 
such that e[o] matches a string «x if and only if e matches the string xo. 
For example, if e is the regular expression (01)*(01) (i.e., one or more 
occurrences of 01) then e[1] is equal to (01)*0 and e[0] will be @. (Can 
you see why?) 

Algorithm 6.13 computes the restriction e[a] given a regular ex- 
pression e and an alphabet symbol ø. It always terminates, since the 
recursive calls it makes are always on expressions smaller than the 
input expression. Its correctness can be proven by induction on the 
length of the regular expression e, with the base cases being when e is 
"" Ø, or a single alphabet symbol 7. 
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Using this notion of restriction, we can define the following recur- 
sive algorithm for regular expression matching: 


By the definition of a restriction, for every o € Landa’ € ¥*, 


the expression e matches z'o if and only if e[o] matches x’. Hence for 
ele, _,](o'" Zn-2) = &(x) and Algorithm 6.14 
does return the correct answer. The only remaining task is to analyze 


every eand x € X”, ® 


its running time. Note that Algorithm 6.14 uses the MATCHEMPTY 
procedure of Solved Exercise 6.3 in the base case that x = "". However, 
this is OK since this procedure’s running time depends only on e and 
is independent of the length of the original input. 

For simplicity, let us restrict our attention to the case that the al- 
phabet © is equal to {0,1}. Define C'(£) to be the maximum number 
of operations that Algorithm 6.13 takes when given as input a regular 
expression e over {0,1} of at most £ symbols. The value C (£) can be 
shown to be polynomial in £, though this is not important for this the- 
orem, since we only care about the dependence of the time to compute 
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®,.(x) on the length of x and not about the dependence of this time on 
the length of e. 

Algorithm 6.14 is a recursive algorithm that input an expression 
e anda string x € {0,1}", does computation of at most C (|e|) steps 
and then calls itself with input some expression e’ and a string x’ of 
length n — 1. It will terminate after n steps when it reaches a string of 
length 0. So, the running time T (e, n) that it takes for Algorithm 6.14 
to compute ®, for inputs of length n satisfies the recursive equation: 


T(e,n) = max{T(e[0],n — 1), T(e[1],n — 1)} + C(e]) (6.2) 


(In the base case n = 0, T (e, 0) is equal to some constant depending 
only on e.) To get some intuition for the expression Eq. (6.2), let us 
open up the recursion for one level, writing T (e, n) as 


T(e,n) = max{T (e[0][0], n — 2) + C(lelO]]), 


T (e[O][1],n — 2) + C (le[0]]), 
T (e[1][0],n — 2) + C(le[1]]), 
TeMi], n — 2) + C(e} + Cael). 


Continuing this way, we can see that T(e,n) < n - C(L) + O(1) 


where L is the largest length of any expression e’ that we encounter 
along the way. Therefore, the following claim suffices to show that 


Algorithm 6.14 runs in O(n) time: 


Claim: Let e be a regular expression over {0, 1}, then there is a num- 
if 
we define e’ = e[a,][a,]--- [an1] (i-e., restricting e to ag, and then a, 
and so on and so forth), then |e’| < L(e). 


ber L(e) € N, such that for every sequence of symbols ao, ... ,a 


n—-1s 


Proof of claim: For a regular expression e over {0,1} anda € {0,1}”, 
we denote by efa] the expression e[a9][a] --- [&m-1] obtained by restrict- 
ing e to a, and then to a, and so on. We let S(e) = {ela]|a € {0, 1}*}. 
We will prove the claim by showing that for every e, the set S(e) is fi- 
nite, and hence so is the number L(e) which is the maximum length of 
e’ for e’ € S(e). 


We prove this by induction on the structure of e. If e is a symbol, the 
empty string, or the empty set, then this is straightforward to show 

as the most expressions S(e) can contain are the expression itself, "", 
and Ø. Otherwise we split to the two cases (i) e = e” and (ii) e = 
e'e”, where e’, e” are smaller expressions (and hence by the induction 
hypothesis S(e’) and S(e”) are finite). In the case (i), if e = (e’)* then 
ela] is either equal to (e’)*e’ [a] or it is simply the empty set if e’[a] = Ø. 
Since e’ [a] is in the set S(e’), the number of distinct expressions in 

S(e) is at most |S(e’)| + 1. In the case (ii), ife = e'e” then all the 
restrictions of e to Strings a will either have the form e’e” [a] or the form 


e'e” [al|e’[a’] where a’ is some string such that a = a'a” and e”[a” 


1a" 
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matches the empty string. Since e” [a] € S(e”) and e’[a’] € S(e’), the 
number of the possible distinct expressions of the form e[a] is at most 
|S(e”)| + |S(e”)| - |S(e’)|. This completes the proof of the claim. 


The bottom line is that while running Algorithm 6.14 on a regular 
expression e, all the expressions we ever encounter are in the finite set 
S(e), no matter how large the input z is, and so the running time of 
Algorithm 6.14 satisfies the equation T(n) = T(n — 1) + C” for some 
constant C” depending on e. This solves to O(n) where the implicit 
constant in the O notation can (and will) depend on e but crucially, 
not on the length of the input x. 


6.4.1 Matching regular expressions using DFAs 

Theorem 6.12 is already quite impressive, but we can do even better. 
Specifically, no matter how long the string x is, we can compute ®, (x) 
by maintaining only a constant amount of memory and moreover 
making a single pass over x. That is, the algorithm will scan the input 
x once from start to finish, and then determine whether or not zx is 
matched by the expression e. This is important in the common case 
of trying to match a short regular expression over a huge file or docu- 
ment that might not even fit in our computer’s memory. Of course, as 
we have seen before, a single-pass constant-memory algorithm is sim- 
ply a deterministic finite automaton. As we will see in Theorem 6.17, a 
function can be computed by regular expression if and only if it can be 
computed by a DFA. We start with showing the “only if” direction: 


Theorem 6.15 — DFA for regular expression matching. Let e be a regular 
expression. Then there is an algorithm that oninputz e {0,1% 
computes ®,(x) while making a single pass over x and maintaining 
a constant amount of memory. 


Proof Idea: 

The single-pass constant-memory for checking if a string matches 
a regular expression is presented in Algorithm 6.16. The idea is to 
replace the recursive algorithm of Algorithm 6.14 with a dynamic pro- 
gram, using the technique of memoization. If you haven’t taken yet an 
algorithms course, you might not know these techniques. This is OK; 
while this more efficient algorithm is crucial for the many practical 
applications of regular expressions, it is not of great importance for 
this book. 

* 
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Proof of Theorem 6.15. Algorithm 6.16 checks if a given string x € X* 
is matched by the regular expression e. For every regular expression 


e, this algorithm has a constant number of Boolean variables (specif- 
ically a variable v,, for every e’ € S(e) and a variable last, for every 
e’ in S(e), using the fact that e’|[x,] is in S(e) for every e’ € S(e)). It 
makes a single pass over the input string. Hence it corresponds to a 
DFA. We prove its correctness by induction on the length n of the in- 
put. Specifically, we will argue that before reading x,, the variable v,, 
is equal to ©, (zo ---x,_,) for every e’ € S(e). Inthe casei = 0 this 
holds since we initialize vy = ®,,("") for alle’ € S(e). Fori > 0 
this holds by induction since the inductive hypothesis implies that 
last, = ©,, (£o £;_) for all e’ € S(e) and by the definition of the set 
S(e’), for every e’ € S(e) and a,_, E€ U,e” = e’[x,_,] isin S(e) and 
Be (T0 Tii) = Ber (To 2). 


6.4.2 Equivalence of regular expressions and automata 

Recall that a Boolean function F : {0,1}* — {0,1} is defined to be 
regular if it is equal to ®, for some regular expression e. (Equivalently, 
alanguage L C {0,1}* is defined to be regular if there is a regular 
expression e such that e matches z iff x € L.) The following theorem is 
the central result of automata theory: 


Theorem 6.17 — DFA and regular expression equivalency. Let F : {0,1}* > 
{0,1}. Then F is regular if and only if there exists a DFA (T, 5) that 


computes F. 


Proof Idea: 
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One direction follows from Theorem 6.15, which shows that for 
every regular expression e, the function ®, can be computed by a DFA 
(see for example Fig. 6.6). For the other direction, we show that given 
a DFA (T, £) for every v, w € [C] we can find a regular expression that 
would match x € {0, 1}* if and only if the DFA starting in state v, will 
end up in state w after reading z. 

* 


Proof of Theorem 6.17. Since Theorem 6.15 proves the “only if” direc- 
tion, we only need to show the “if” direction. Let A = (TS) be a DFA 
with C states that computes the function F. We need to show that F is 
regular. 

For every v, w € [C], we let F, wœ : {0,1}* — {0,1} be the function 
that maps x € {0,1}* to 1 if and only if the DFA A, starting at the 
state v, will reach the state w if it reads the input x. We will prove that 
Poa is regular for every v, w. This will prove the theorem, since by 
Definition 6.2, F (x) is equal to the OR of F) „(x) for every w € S. 
Hence if we have a regular expression for every function of the form 
F; w then (using the | operation), we can obtain a regular expression 
for F as well. 

To give regular expressions for the functions F 


vw! 


we start by 
defining the following functions FY „: for every v,w € [C] and 
0 < t < C, Ff w(x) = 1if and only if starting from v and observ- 
ing x, the automata reaches w with all intermediate states being in the set 
[t] = {0,...,t — 1} (see Fig. 6.7). That is, while v, w themselves might 
be outside [t], Fý „(x) = 1 if and only if throughout the execution of 
the automaton on the input x (when initiated at v) it never enters any 
of the states outside |t] and still ends up at w. Ift = 0 then [|t] is the 
empty set, and hence F9} „(x) = 1 if and only if the automaton reaches 
w from v directly on x, without any intermediate state. If t = C then 
all states are in [t], and hence Fý „ = Fy w- 
We will prove the theorem by induction on t, showing that FY „ is 
regular for every v, w and t. For the base case of t = 0, F? 


v,w 


is regular 
for every v, w since it can be described as one of the expressions "", Ø, 
0, 1 or 0|1. Specifically, if v = w then F} „(x) = 1if and only if x is 
the empty string. If v + w then F} „(x) = 1 if and only if x consists 

of a single symbol ø € {0,1} and T(v,o) = w. Therefore in this case 
F? „ corresponds to one of the four regular expressions 0|1, 0, 1 or Ø, 
depending on whether A transitions to w from v when it reads either 0 
or 1, only one of these symbols, or neither. 

Inductive step: Now that we’ve seen the base case, let us prove 
the general case by induction. Assume, via the induction hypothesis, 
that for every v’,w’ € [C], we have a regular expression Rt, „ that 
computes F', ,,,. We need to prove that Fyt, is regular for every v, w. 


wW 


OTRE 


Figure 6.6: A deterministic finite automaton that 
computes the function ®(g1)+. 


Q% MO 


Figure 6.7: Given a DFA of C states, for every v, w € 
[C] and number t € {0, ..., C} we define the function 
Fi, : {0,1}* > {0,1} to output one on input 

x € {0, 1}* if and only if when the DFA is initialized 
in the state v and is given the input z, it will reach the 
state w while going only through the intermediate 
states {0,...,¢— 1}. 
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If the automaton arrives from v to w using the intermediate states 
[t+1], then it visits the t-th state zero or more times. If the path labeled 
by x causes the automaton to get from v to w without visiting the t- 

th state at all, then x is matched by the regular expression Rf, „. If 

the path labeled by x causes the automaton to get from v to w while 
visiting the t-th state k > 0 times, then we can think of this path as: 


e First travel from v to t using only intermediate states in [t — 1]. 


e Then go from t back to itself k — 1 times using only intermediate 
states in [t — 1] 


e Then go from t to w using only intermediate states in |t — 1]. 


Therefore in this case the string x is matched by the regular expres- 


sion Ri, ¿(Ri +)“ Rt» (See also Fig. 6.8.) 


+1 
yw 


Therefore we can compute Ftt} using the regular expression 

Ri w | Ri: (Ri A a : 
This completes the proof of the inductive step and hence of the theo- 
rem. 


6.4.3 Closure properties of regular expressions 

If F and G are regular functions computed by the expressions e and f 
respectively, then the expression e| f computes the function H = F V G 
defined as H(x) = F(x) V G(x). Another way to say this is that the set 
of regular functions is closed under the OR operation. That is, if F and G 
are regular then so is F V G. An important corollary of Theorem 6.17 
is that this set is also closed under the NOT operation: 


Figure 6.8: If we have regular expressions Ri, i 
corresponding to F‘, „ for every v’, w’ € [C], we can 
obtain a regular expression Rt; corresponding to 
Fit). The key observation is that a path from v to w 
using {0, ..., t} either does not touch t at all, in which 
case it is captured by the expression Rt „, or it goes 
from v to t, comes back to t zero or more times, and 
then goes from t to w, in which case it is captured by 
the expression Rf, ,(Ri4)* Ri w 


FUNCTIONS WITH INFINITE DOMAINS, AUTOMATA, AND REGULAR EXPRESSIONS 


Lemma 6.18 — Regular expressions closed under complement. If F : {0,1}* > 
{0, 1} is regular then so is the function F, where F(x) = 1 — F(x) for 
every x € {0,1}*. 


Proof. If F is regular then by Theorem 6.12 it can be computed by a 
DFA A. But we can then construct a DFA A which does the same com- 
putation but flips the set of accepted states. The DFA A will compute 
F. By Theorem 6.17 this implies that F is regular as well. 

E 


Since a ^A b = aV b, Lemma 6.18 implies that the set of regular 
functions is closed under the AND operation as well. Moreover, since 
OR, NOT and AND are a universal basis, this set is also closed un- 
der NAND, XOR, and any other finite function. That is, we have the 
following corollary: 


Theorem 6.19 — Closure of regular expressions. Let f : {0,1}* — {0, 1} be 
any finite Boolean function, and let Fo, ..., F,_; : {0,1}* > {0,1} be 
regular functions. Then the function G(x) = f(Fo(x), F\(2),..., F,_1(2)) 
is regular. 


Proof. This is a direct consequence of the closure of regular functions 
under OR and NOT (and hence AND), combined with Theorem 4.13, 
that states that every f can be computed by a Boolean circuit (which is 
simply a combination of the AND, OR, and NOT operations). 

a 


6.5 LIMITATIONS OF REGULAR EXPRESSIONS AND THE PUMPING 
LEMMA 


The efficiency of regular expression matching makes them very useful. 
This is why operating systems and text editors often restrict their 
search interface to regular expressions and do not allow searching by 
specifying an arbitrary function. However, this efficiency comes at 

a cost. As we have seen, regular expressions cannot compute every 
function. In fact, there are some very simple (and useful!) functions 
that they cannot compute. Here is one example: 


Lemma 6.20 — Matching parentheses. Let © = {(,)} and MATCHPAREN : 
x* — {0,1} be the function that given a string of parentheses, out- 
puts 1 if and only if every opening parenthesis is matched by a corre- 
sponding closed one. Then there is no regular expression over X that 
computes MATCHPAREN. 


Lemma 6.20 is a consequence of the following result, which is 
known as the pumping lemma: 
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Theorem 6.21 — Pumping Lemma. Let e be a regular expression over 
some alphabet X. Then there is some number ng such that for ev- 
ery w € b* with |w| > ng and ®,(w) = 1, we can write w = xyz for 
strings x,y,z € X* satisfying the following conditions: 


1. jy) >1. 
2 zyl < no: 


3. ®.(xy*z) = 1 for every k € N. 


KX 


| x 


Proof Idea: 

The idea behind the proof is the following. Let ng be twice the 
number of symbols that are used in the expression e, then the only 
way that there is some w with |w| > ng and ®,(w) = 1 is that e con- 
tains the » (i.e. star) operator and that there is a non-empty substring 
y of w that was matched by (e’)* for some sub-expression e’ of e. We 


can now repeat y any number of times and still get a matching string. 
See also Fig. 6.9. 
* 


Figure 6.9: To prove the “pumping lemma” we look 

at a word w that is much larger than the regular 
expression e that matches it. In such a case, part of 

w must be matched by some sub-expression of the 
form (e’)*, since this is the only operator that allows 
matching words longer than the expression. If we 
look at the “leftmost” such sub-expression and define 
y" to be the string that is matched by it, we obtain the 
partition needed for the pumping lemma. 
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Proof of Theorem 6.21. To prove the lemma formally, we use induction 
on the length of the expression. Like all induction proofs, this will 
be somewhat lengthy, but at the end of the day it directly follows the 
intuition above that somewhere we must have used the star operation. 
Reading this proof, and in particular understanding how the formal 
proof below corresponds to the intuitive idea above, is a very good 
way to get more comfortable with inductive proofs of this form. 

Our inductive hypothesis is that for an n length expression, ng = 
2n satisfies the conditions of the lemma. The base case is when the 
expression is a single symbol oa € » or that the expression is Ø or 
"". In all these cases the conditions of the lemma are satisfied simply 
because ny = 2, and there exists no string x of length larger than ng 
that is matched by the expression. 

We now prove the inductive step. Let e be a regular expression 
withn > 1 symbols. We set nọ = 2n and letw € »* bea string 
satisfying |w| > nọ. Since e has more than one symbol, it has one of 
the forms (a) e’|e”, (b), (e’)(e”), or (c) (e’)* where in all these cases 
the subexpressions e’ and e” have fewer symbols than e and hence 
satisfy the induction hypothesis. 

In the case (a), every string w matched by e must be matched by 
either e’ or e”. If e’ matches w then, since |w| > 2|e’|, by the induction 
hypothesis there exist x, y, z with |y| > 1 and |zy| < 2|e’| < ng such 
that e’ (and therefore also e = e’ |e”) matches ry*z for every k. The 
same arguments works in the case that e” matches w. 

In the case (b), if w is matched by (e’)(e”) then we can write w = 
ww” where e’ matches w’ and e” matches w”. We split to subcases. If 
|w’| > 2|e’| then by the induction hypothesis there exist x, y, z’ with 
ly] > 1, |ay| < 2|e’| < no such that w’ = xyz’ and e’ matches xy*2z’ 
for every k € N. This completes the proof since if we set z = z’w” 
then we see that w = w’w” = xyz and e = (e’)(e”) matches ry*z for 
every k € N. Otherwise, if |w’| < 2|e’| then since |w| = |w’| + |w”| > 
no = 2(\e’| + Je” |), it must be that |w”| > 2|e”|. Hence by the induction 
hypothesis there exist x’, y, z such that |y| > 1, |x’y| < 2ļe”| and e” 
matches x’y"z for every k € N. But now if we set x = w’x’ we see that 
jzy| = |w’| + |a’y| < 2le’| + 2|e”| = ng and on the other hand the 
expression e = (e’)(e”) matches xyz = w’ax’y*z for every k € N. 

In case (c), if w is matched by (e’)* then w = wg- w, where for 
every i € [t], w; is a nonempty string matched by e’. If [wọ] > 2\e’|, 
then we can use the same approach as in the concatenation case above. 
Otherwise, we simply note that if x is the empty string, y = wọ, and 
z = w =- w then |ry| < ng and zy*z is matched by (e’)* for every 
KEN. 


243 


244 INTRODUCTION TO THEORETICAL COMPUTER SCIENCE 


Using the pumping lemma, we can easily prove Lemma 6.20 (i.e., 
the non-regularity of the “matching parenthesis” function): 


Proof of Lemma 6.20. Suppose, towards the sake of contradiction, that 
there is an expression e such that ®, = MATCHPAREN. Let ng be 
the number obtained from Theorem 6.21 and let w = (")"9 (i.e., no 
left parenthesis followed by no right parenthesis). Then we see that 
if we write w = xyz as in Theorem 6.21, the condition |xy| < no 
implies that y consists solely of left parenthesis. Hence the string 
xyz will contain more left parenthesis than right parenthesis. Hence 
MATCHPAREN(ay?z) = 0 but by the pumping lemma ©, (xyz) = 1, 
contradicting our assumption that ®, = MATCHPAREN. 

a 


The pumping lemma is a very useful tool to show that certain func- 
tions are not computable by a regular expression. However, it is not an 
“if and only if” condition for regularity: there are non-regular func- 
tions that still satisfy the pumping lemma conditions. To understand 
the pumping lemma, it is crucial to follow the order of quantifiers in 
Theorem 6.21. In particular, the number nọ in the statement of Theo- 
rem 6.21 depends on the regular expression (in the proof we chose ng 
to be twice the number of symbols in the expression). So, if we want 
to use the pumping lemma to rule out the existence of a regular ex- 
pression e computing some function F, we need to be able to choose 
an appropriate input w € {0,1}* that can be arbitrarily large and 
satisfies F(w) = 1. This makes sense if you think about the intuition 
behind the pumping lemma: we need w to be large enough as to force 
the use of the star operator. 


Solved Exercise 6.4 — Palindromes is not regular. Prove that the following 
function over the alphabet {0, 1, ; } is not regular: PAL(w) = 1 if and 
only if w = u; u”? where u € {0,1}* and u? denotes u “reversed”: 
the string wu), “Uo. (The Palindrome function is most often defined 
without an explicit separator character ;, but the version with sucha 
separator is a bit cleaner, and so we use it here. This does not make 
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Exercise: Let F: {0,1}* — {0,1} defined such that F(x) = 1 iff x = 0”1” forn EN. 
Prove that F is not regular. 


Blue Team: Student proving F is not regular Red Team: Hypothetical “adversary” claiming F is regular 


“F is computed by a regular expression exp” 


“Is that so? Then what is the number whose 
existence is guaranteed by the pumping lemma?” 


“Here is the number — you can call it ngo” 


“In this case, let me choose w = 0”01”0, Notice 
that F(w) = 1. What is the partition w = xyz from 
the pumping lemma?” 
“Since |xy| < no and |y| => 1, I guess | am forced to use 
= 0% y= 0? 2 = 07074"? "0 for b S 1 anda <ng—b" 


“In this case, since | can choose k as | want, let me set k = 2 
and note that xyz = 0"0+1"0 which contradicts the pumping 
lemma conclusion that F(xy*z) = 1!” 


Pumping Lemma: If exp computes F there exists ngo such that for every w with F(w) = 1 and |w] > no there 
exists partition w = xyz with |xy| < no and |y| > 1 such that for every k € N it holds that F(xy*z) =i 


Figure 6.10: A cartoon of a proof using the pumping lemma that a function F is not regular. The pumping lemma states that if F is regular then there 
exists a number ny such that for every large enough w with F(w) = 1, there exists a partition of wtow = xyz satisfying certain conditions such 
that for everyk € N, F(ay*z) = 1. You can imagine a pumping-lemma based proof as a game between you and the adversary. Every there exists 
quantifier corresponds to an object you are free to choose on your own (and base your choice on previously chosen objects). Every for every quantifier 
corresponds to an object the adversary can choose arbitrarily (and again based on prior choices) as long as it satisfies the conditions. A valid proof 
corresponds to a strategy by which no matter what the adversary does, you can win the game by obtaining a contradiction which would be a choice 
of k that would result in F(ay*z) = 0, hence violating the conclusion of the pumping lemma. 
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much difference, as one can easily encode the separator as a special 
binary string instead.) 


Solution: 

We use the pumping lemma. Suppose toward the sake of con- 
tradiction that there is a regular expression e computing PAL, 
and let ng be the number obtained by the pumping lemma (The- 
orem 6.21). Consider the string w =  0"°;0"°. Since the reverse 
of the all zero string is the all zero string, PAL(w) = 1. Now, by 
the pumping lemma, if PAL is computed by e, then we can write 
w = xryzsuch that |zy| < no l|y| > land PAL(xy*z) = 1 for 
every k € N. In particular, it must hold that PAL(xz) = 1, but this 
is a contradiction, since rz = 0%o-lyl; 0”o and so its two parts are 
not of the same length and in particular are not the reverse of one 
another. 


For yet another example of a pumping-lemma based proof, see 
Fig. 6.10 which illustrates a cartoon of the proof of the non-regularity 
of the function F : {0,1}* — {0,1} which is defined as F(x) = 1 iff 
x = 01" for somen €E N (i.e., x consists of a string of consecutive 
zeroes, followed by a string of consecutive ones of the same length). 


6.6 ANSWERING SEMANTIC QUESTIONS ABOUT REGULAR EX- 
PRESSIONS 


Regular expressions have applications beyond search. For example, 
regular expressions are often used to define tokens (such as what is a 
valid variable identifier, or keyword) in the design of parsers, compilers 
and interpreters for programming languages. Regular expressions 
have other applications too: for example, in recent years, the world 

of networking moved from fixed topologies to “software defined 
networks”. Such networks are routed by programmable switches 
that can implement policies such as “if packet is secured by SSL then 
forward it to A, otherwise forward it to B”. To represent such policies 
we need a language that is on one hand sufficiently expressive to 
capture the policies we want to implement, but on the other hand 
sufficiently restrictive so that we can quickly execute them at network 
speed and also be able to answer questions such as “can C see the 
packets moved from A to B?”. The NetKAT network programming 
language uses a variant of regular expressions to achieve precisely 
that. For this application, it is important that we are not merely able 
to answer whether an expression e matches a string x but also answer 
semantic questions about regular expressions such as “do expressions 
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e and e’ compute the same function?” and “does there exist a string x 
that is matched by the expression e?”. The following theorem shows 
that we can answer the latter question: 


Theorem 6.23 — Emptiness of regular languages is computable. There is an 
algorithm that given a regular expression e, outputs 1 if and only if 
®, is the constant zero function. 


Proof Idea: 

The idea is that we can directly observe this from the structure 
of the expression. The only way a regular expression e computes 
the constant zero function is if e has the form Ø or is obtained by 
concatenating Ø with other expressions. 

* 


Proof of Theorem 6.23. Define a regular expression to be “empty” if it 
computes the constant zero function. Given a regular expression e, we 
can determine if e is empty using the following rules: 


e Ife has the form o or "" then it is not empty. 


e Ife isnot empty then ele’ is not empty for every e’. 


If e is not empty then e* is not empty. 


If e and e’ are both not empty then e e’ is not empty. 


Ø is empty. 


Using these rules, it is straightforward to come up with a recursive 
algorithm to determine emptiness. 
E 


Using Theorem 6.23, we can obtain an algorithm that determines 
whether or not two regular expressions e and e’ are equivalent, in the 
sense that they compute the same function. 


Theorem 6.24 — Equivalence of regular expressions is computable. Let 


REGEQ : {0,1}* — {0,1} be the function that on input (a string 


representing) a pair of regular expressions e, e’, REGEQ(e,e’) = 1 
if and only if $, = ®,,. Then there is an algorithm that computes 
REGEQ. 

Proof Idea: 


The idea is to show that given a pair of regular expressions e and 
e’ we can find an expression e” such that ®,,(x) = 1 if and only if 
®.(x) # Pa (x). Therefore ®,, is the constant zero function if and only 
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if e and e’ are equivalent, and thus we can test for emptiness of e” to 
determine equivalence of e and e’. 
* 


Proof of Theorem 6.24. We will prove Theorem 6.24 from Theorem 6.23. 
(The two theorems are in fact equivalent: it is easy to prove Theo- 
rem 6.23 from Theorem 6.24, since checking for emptiness is the same 
as checking equivalence with the expression Ø.) Given two regu- 
lar expressions e and e’, we will compute an expression e” such that 
n(x) = 1 if and only if (x) # ®,, (x). One can see that e is equiva- 
lent to e’ if and only if e” is empty. 

We start with the observation that for every bit a,b € {0,1}, a # bif 
and only if 

(anb) V (@Ab). 


Hence we need to construct e” such that for every z, 


Bl2) = (#2) ADE) V EE AB (c)). (63) 


To construct the expression e”, we will show how given any pair of 
expressions e and e’, we can construct expressions e ^ e’ and € that 
compute the functions ®, \ ®,, and ©, respectively. (Computing the 
expression for e V e’ is straightforward using the | operation of regular 
expressions. ) 

Specifically, by Lemma 6.18, regular functions are closed under 
negation, which means that for every regular expression e, there is an 
expression € such that ®,(x) = 1 — ®, (zx) for every x € {0,1}*. Now, 
for every two expressions e and e’, the expression 


computes the AND of the two expressions. Given these two transfor- 
mations, we see that for every regular expressions e and e’ we can find 
a regular expression e” satisfying (6.3) such that e” is empty if and 
only if e and e’ are equivalent. 


© Chapter Recap 


e We model computational tasks on arbitrarily large 
inputs using infinite functions F : {0,1}* > {0,1}*. 

e Such functions take an arbitrarily long (but still 
finite!) string as input, and cannot be described by 
a finite table of inputs and outputs. 

e A function with a single bit of output is known as 
a Boolean function, and the task of computing it is 
equivalent to deciding a language L C {0,1}*. 
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6.7 EXERCISES 


Exercise 6.1 — Closure properties of regular functions. Suppose that F,G : 
{0,1}* — {0,1} are regular. For each one of the following defini- 
tions of the function H, either prove that H is always regular or give a 
counterexample for regular F', G that would make H not regular. 


1. A(x) = F(x) V G(x). 
2. H(x) = F(x) A G(x) 
3. H(z) = NAND(F(z), G(2)). 
4. H(x) = F(x”) where z? is the reverse of x: £? = 2, %)_9°*'X, for 
n= |z]. 
5. H(z) = 1 x= wst. F(u) =G(v)=1 
0 otherwise 


A E 1 x= uus.t. F(u)= G(u)=1 
0 otherwise 


1 gx=uu? st. F(u) = Gu) =1 
0 otherwise 


Exercise 6.2 One among the following two functions that map {0, 1}* 
to {0,1} can be computed by a regular expression, and the other one 
cannot. For the one that can be computed by a regular expression, 
write the expression that does it. For the one that cannot, prove that 
this cannot be done using the pumping lemma. 


e F(x) = 1if 4 divides Pa x, and F(x) = 0 otherwise. 


|a|-1 
i=0 


e G(x) = 1if and only if }> x, > |x|/4 and G(x) = 0 otherwise. 


Exercise 6.3 — Non-regularity. 1. Prove that the following function F : 
{0, 1}* > {0, 1} is not regular. For every x € {0,1}*, F(x) = 1 iff x is 
of the form z = 1° for some i > 0. 
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2. Prove that the following function F : {0,1}* > {0,1} is not regular. 
For every x € {0,1}*, F(x) = 1 iff}, x; = 3° for some i > 0. 


6.8 BIBLIOGRAPHICAL NOTES 


The relation of regular expressions with finite automata is a beautiful 
topic, on which we only touch upon in this text. It is covered more 
extensively in [Sip97; HMU14; Koz97]. These texts also discuss top- 
ics such as non-deterministic finite automata (NFA) and the relation 
between context-free grammars and pushdown automata. 

The automaton of Fig. 6.4 was generated using the FSM simulator 
of Ivan Zuzak and Vedrana Jankovic. Our proof of Theorem 6.12 is 
closely related to the Myhill-Nerode Theorem. One direction of the 
Myhill-Nerode theorem can be stated as saying that if e is a regular 
expression then there is at most a finite number of strings Zo, ... , Zk—1 
such that ajz) # Pelz; for every 0 <i j < k. 


7 


Loops and infinity 


_ “The bounds of arithmetic were however outstepped the moment the 
idea of applying the [punched] cards had occurred; and the Analyt- 
ical Engine does not occupy common ground with mere ‘calculating 
machines.’... In enabling mechanism to combine together general sym- 
bols, in successions of unlimited variety and extent, a uniting link is 
established between the operations of matter and the abstract mental 
processes of the most abstract branch of mathematical science.”_, Ada 
Augusta, countess of Lovelace, 1843 


As the quote of Chapter 6 says, an algorithm is “a finite answer to 


an infinite number of questions”. To express an algorithm, we need to 
write down a finite set of instructions that will enable us to compute 
on arbitrarily long inputs. To describe and execute an algorithm we 


need the following components (see Fig. 7.1): 


The finite set of instructions to be performed. 
Some “local variables” or finite state used in the execution. 


A potentially unbounded working memory to store the input and 
any other values we may require later. 


While the memory is unbounded, at every single step we can only 
read and write to a finite part of it, and we need a way to address 
which are the parts we want to read from and write to. 


If we only have a finite set of instructions but our input can be 


arbitrarily long, we will need to repeat instructions (i.e., loop back). 
We need a mechanism to decide when we will loop and when we 
will halt. 


Compiled on 12.19.2022 22:58 


Learning Objectives: 


e Learn the model of Turing machines, which 
can compute functions of arbitrary input 
lengths. 


See a programming-language description of 
Turing machines, using NAND-IM programs, 
which add loops and arrays to NAND-CIRC. 


See some basic syntactic sugar and 
equivalence of variants of Turing machines 
and NAND-IM programs. 


Input: X € {0,17 


Operations: 


| xi + Ø: =] Unbounded memory / arrays. 
mes- XORĘ(res, xf) IB Finite state / local variables. 
ee 


a Addressing mechanism / indexing 
return FES BB Finite logic 
| | Looping, halting 


Figure 7.1: An algorithm is a finite recipe to compute 
on arbitrarily long inputs. The components of an 
algorithm include the instructions to be performed, 
finite state or “local variables”, the memory to store 
the input and intermediate computations, as well as 
mechanisms to decide which part of the memory to 
access, and when to repeat instructions and when to 
halt. 
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Specification Implementation 
g r 
2 c tl = AND(X[0],X[1]) 
= 9 = a notx® = NOT(X[0]) 
= + e t2 = AND(notx0,X[2]) 
GJ GO a Y[0] = OR(t1,t2) 
2 27a = 
8 922 
S 2 € u = NAND(X(@],X[2]) 
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7.1 TURING MACHINES 


” Computing is normally done by writing certain symbols on paper. We may 
suppose that this paper is divided into squares like a child’s arithmetic book.. 
The behavior of the [human] computer at any moment is determined by the 
symbols which he is observing, and of his’ state of mind’ at that moment... We 
may suppose that in a simple operation not more than one symbol is altered.”, 
“We compare a man in the process of computing ... to a machine which is only 
capable of a finite number of configurations... The machine is supplied with a 


Figure 7.2: Overview of our models for finite and 
unbounded computation. In the previous chapters 
we study the computation of finite functions, which 
are functions f : {0,1}"” — {0,1} for some fixed 
n,m, and modeled computing these functions using 
circuits or straight-line programs. In this chapter we 
study computing unbounded functions of the form 

F : {0,1} > {0,1} or F : {0,1}* — {0,1}*. 
We model computing these functions using Turing 
Machines or (equivalently) NAND-TM programs, 
which add the notion of loops to the NAND-CIRC 
programming language. In Chapter 8 we will show 
that these models are equivalent to many other 
models, including RAM machines, the A calculus, and 
all the common programming languages including C, 
Python, Java, JavaScript, etc. 


‘tape’ (the analogue of paper) ... divided into sections (called ‘squares’) each 
capable of bearing a ‘symbol’ ”, Alan Turing, 1936 


_ “What is the difference between a Turing machine and the modern 
computer? It’s the same as that between Hillary’s ascent of Everest and 
the establishment of a Hilton hotel on its peak.”_ , Alan Perlis, 1982. 


The “granddaddy” of all models of computation is the Turing ma- 
chine. Turing machines were defined in 1936 by Alan Turing in an 
attempt to formally capture all the functions that can be computed 
by human “computers” (see Fig. 7.4) that follow a well-defined set of 
rules, such as the standard algorithms for addition or multiplication. 

Turing thought of such a person as having access to as much 
“scratch paper” as they need. For simplicity, we can think of this 
scratch paper as a one dimensional piece of graph paper (or tape, as 
it is commonly referred to). The paper is divided into “cells”, where 
each “cell” can hold a single symbol (e.g., one digit or letter, and more 
generally, some element of a finite alphabet). At any point in time, the 
person can read from and write to a single cell of the paper. Based on 
the contents of this cell, the person can update their finite mental state, 
and/or move to the cell immediately to the left or right of the current 
one. 

Turing modeled such a computation by a “machine” that maintains 
one of k states. At each point in time the machine reads from its “work 
tape” a single symbol from a finite alphabet © and uses that to up- 
date its state, write to tape, and possibly move to an adjacent cell (see 
Fig. 7.7). To compute a function F using this machine, we initialize the 
tape with the input x € {0, 1}* and our goal is to ensure that the tape 
will contain the value F(x) at the end of the computation. Specifically, 
a computation of a Turing machine M with k states and alphabet © on 
input x € {0, 1}* proceeds as follows: 


e Initially the machine is at state 0 (known as the “starting state”) 
and the tape is initialized to [>, £o, ...,%,_1, Ø, Ø, .... We use the 
symbol [> to denote the beginning of the tape, and the symbol Ø to 
denote an empty cell. We will always assume that the alphabet X is 
a (potentially strict) superset of {[>, ©, 0, 1}. 


e The location į to which the machine points to is set to 0. 


e At each step, the machine reads the symbol o = T[i] that is in the 
it” location of the tape. Based on this symbol and its state s, the 
machine decides on: 


- What symbol o’ to write on the tape 
— Whether to move Left (i.e., i + i — 1), Right (i.e., i << i + 1), Stay 
in place, or Halt the computation. 
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oes: 


Figure 7.3: Aside from his many other achievements, 
Alan Turing was an excellent long-distance runner 
who just fell shy of making England’s Olympic team. 
A fellow runner once asked him why he punished 
himself so much in training. Alan said “I have such 
a stressful job that the only way I can get it out of my 
mind is by running hard; it’s the only way I can get 
some release.” 


Figure 7.4: Until the advent of electronic computers, 
the word “computer” was used to describe a person 
that performed calculations. Most of these “human 
computers” were women, and they were absolutely 
essential to many achievements, including mapping 
the stars, breaking the Enigma cipher, and the NASA 
space mission; see also the bibliographical notes. 
Photo from National Photo Company Collection; see 
also [Sob17]. 


Figure 7.5: Steam-powered Turing machine mural, 
painted by CSE grad students at the University of 
Washington on the night before spring qualifying 
examinations, 1987. Image from https://www.cs. 
washington.edu/building/art/SPTM. 
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— What is going to be the new state s € [k] 


e The set of rules the Turing machine follows is known as its transi- 
tion function. 


e When the machine halts, its output is the binary string obtained by 
reading the tape from the beginning until the first location in which 
it contains a Ø symbol, and then outputting all 0 and 1 symbols in 
sequence, dropping the initial [> symbol if it exists, as well as the 
final Ø symbol. 


7.1.1 Extended example: A Turing machine for palindromes 
Let PAL (for palindromes) be the function that on input x € {0,1}*, 
outputs 1 if and only if x is an (even length) palindrome, in the sense 
that x = wg Wp—1Wn—1Wn—2* Wo for some n € Nand w € {0,1}”. 
We now show a Turing machine M that computes PAL. To specify 
M we need to specify (i) M’s tape alphabet £ which should contain at 
least the symbols 0,1, > and @, and (ii) 1’s transition function which 
determines what action M takes when it reads a given symbol while it 
is in a particular state. 
In our case, M will use the alphabet {0,1,[>,@, x} and will have 
k = 13 states. Though the states are simply numbers between 0 and 
k — 1, we will give them the following labels for convenience: 


State Label 


START 
RIGHT_® 
RIGHT_1 
LOOK_FOR_@ 
LOOK_FOR_1 
RETURN 
REJECT 
ACCEPT 
OUTPUT_®@ 
OUTPUT_1 
@_AND_BLANK 
1_AND_BLANK 
BLANK_AND_STOP 
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We describe the operation of our Turing machine M in words: 


e M starts in state START and goes right, looking for the first symbol 
that is 0 or 1. If it finds @ before it hits such a symbol then it moves 
to the OUTPUT_1 state described below. 


Unbounded tape. [E Transition function 
I) Finite state. 


J| Head position 


Halt/move decision 


Figure 7.6: The components of a Turing Machine. Note 
how they correspond to the general components of 
algorithms as described in Fig. 7.1. 


e Once M finds such a symbol b € {0,1}, M deletes b from the tape 
by writing the x symbol, it enters either the RIGHT_Q or RIGHT_1 
mode according to the value of b and starts moving rightwards 
until it hits the first @ or x symbol. 


e Once M finds this symbol, it goes into the state LOOK_FOR_@ or 
LOOK_FOR_1 depending on whether it was in the state RIGHT_@ or 
RIGHT_1 and makes one left move. 


e Inthe state LOOK_FOR_b, M checks whether the value on the tape is 


b. If it is, then M deletes it by changing its value to x, and moves to 


the state RETURN. Otherwise, it changes to the OUTPUT_®@ state. 


e The RETURN state means that M goes back to the beginning. Specifi- 


cally, M moves leftward until it hits the first symbol that is not 0 or 
1, in which case it changes its state to START. 


e The OUTPUT_b states mean that M will eventually output the value 
b. In both the OUTPUT_@ and OUTPUT_1 states, M goes left until it 
hits [>. Once it does so, it makes a right step, and changes to the 


1_AND_BLANK or @_AND_BLANK states respectively. In the latter states, 


M writes the corresponding value, moves right and changes to the 
BLANK_AND_STOP state, in which it writes @ to the tape and halts. 


The above description can be turned into a table describing for each 


one of the 13 - 5 combination of state and symbol, what the Turing 
machine will do when it is in that state and it reads that symbol. This 
table is known as the transition function of the Turing machine. 


7.1.2 Turing machines: a formal definition 


Head 
position: i € N 


[>[o[a]filfofolo]ifofilofojififo]i]o fo fo Jo lo Jo Jo Jo J... 


Local State: 
s € [k] 


Rules / transition function: M: [k] x 2 > [k] x 2 x {L, R, S, H} 


If read 0 and state is 17 then change state to 29, write 1, and move left. 
If read Ø and state is 23 then change state to 15, write 0, and move right. 


If read > and state is 8 then change state to 12, write ©, and halt. 


The formal definition of Turing machines is as follows: 
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Figure 7.7: A Turing machine has access to a tape of 
unbounded length. At each point in the execution, 
the machine can read a single symbol of the tape, 

and based on that and its current state, write a new 
symbol, update the tape, decide whether to move left, 
right, stay, or halt. 
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Definition 7.1 — Turing Machine. A (one tape) Turing machine with k 
states and alphabet © > {0,1,[>,@} is represented by a transition 
function dy, : [k] x £ > [k] x £ x {L, R, S, H}. 

For everyx € {0,1}*, the output of M on input z, denoted by 
M (x), is the result of the following process: 


e We initialize T to be the sequence [>, £o, £1, -= , En—1; Ø, ©, ---, 
where n = |x|. (Thatis, T[0] = >œ, Tfi +1] = a, fori € [n], and 
Tli] = @ fori > n.) 


e We also initialize 1 = 0 and s = 0. 
e We then repeat the following process: 


1. Let (s’,0’, D) = ô m(s, T[i]). 

2. Sets > s’, Tfi] > o’. 

3. If D = R then set i > i+ 1, if D = L then set i —> max{i— 1,0}. 
(If D = S then we keep i the same.) 

4. If D = HK, then halt. 


e Ifthe process above halts, then M’s output, denoted by M(x), is 
the stringy € {0,1}* obtained by concatenating all the symbols 
in {0, 1} in positions T [0], ..., T[i] where i + 1 is the first location 
in the tape containing ©. 


e If The Turing machine does not halt then we denote M(x) = L. 


One should not confuse the transition function 6, of a Turing ma- 


chine M with the function that the machine computes. The transition 
function ôy is a finite function, with k|X| inputs and 4k|=| outputs. 
(Can you see why?) The machine can compute an infinite function F 
that takes as input a string x € {0,1}* of arbitrary length and might 
also produce an arbitrary length string as output. 

In our formal definition, we identified the machine M with its 
transition function ôy since the transition function tells us everything 
we need to know about the Turing machine. However, this choice of 
representation is somewhat arbitrary, and is based on our convention 
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that the state space is always the numbers {0,...,4 — 1} with 0 as 

the starting state. Other texts use different conventions, and so their 
mathematical definition of a Turing machine might look superficially 
different. However, these definitions describe the same computational 
process and have the same computational powers. Hence they are 
equivalent despite their superficial differences. See Section 7.7 for a 
comparison between Definition 7.1 and the way Turing Machines are 
defined in texts such as Sipser [Sip97]. 


7.1.3 Computable functions 
We now turn to make one of the most important definitions in this 
book: computable functions. 


Definition 7.2 — Computable functions. Let F : {0,1}* — {0,1}*be 
a (total) function and let M be a Turing machine. We say that M 
computes F if for every x € {0,1}*, M(x) = F(x). 

We say that a function F is computable if there exists a Turing 
machine M that computes it. 


Defining a function “computable” if and only if it can be computed 
by a Turing machine might seem “reckless” but, as we'll see in Chap- 
ter 8, being computable in the sense of Definition 7.2 is equivalent to 
being computable in virtually any reasonable model of computation. 
This statement is known as the Church-Turing Thesis. (Unlike the ex- 
tended Church-Turing Thesis which we discussed in Section 5.6, the 
Church-Turing thesis itself is widely believed and there are no candi- 
date devices that attack it.) 


This is a good point to remind the reader that functions are not the 
same as programs: 


Functions # Programs . 


A Turing machine (or program) M can compute some function 
F, but it is not the same as F. In particular, there can be more than 
one program to compute the same function. Being computable is a 
property of functions, not of machines. 

We will often pay special attention to functions F : {0,1}* — {0,1} 
that have a single bit of output. Hence we give a special name for the 
set of computable functions of this form. 
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Definition 7.3 — The class R. We define R be the set of all computable 
functions F : {0,1}* > {0,1}. 


7.1.4 Infinite loops and partial functions 


One crucial difference between circuits /straight-line programs and 
Turing machines is the following. Looking at a NAND-CIRC program 
P, we can always tell how many inputs and how many outputs P 

has by simply looking at the X and Y variables. Furthermore, we are 
guaranteed that if we invoke P on any input, then some output will be 
produced. 

In contrast, given a Turing machine M, we cannot determine a 
priori the length of M’s output. In fact, we don’t even know if an 
output would be produced at all! For example, it is straightforward 
to come up with a Turing machine whose transition function never 
outputs H and hence never halts. 

If a machine M fails to stop and produce an output on some input 
x, then it cannot compute any total function F, since clearly on input 
x, M will fail to output F(x). However, M can still compute a partial 
function." 

For example, consider the partial function DIV that on input a pair 
(a,b) of natural numbers, outputs [a/b] ifb > 0, and is undefined 
otherwise. We can define a Turing machine M that computes DIV on 
input a, b by outputting the first c = 0,1, 2,... such that cb > a. Ifa > 0 
and b = 0 then the machine M will never halt, but this is OK, since 
DIV is undefined on such inputs. If a = 0 and b = 0, the machine M 


1 A partial function F froma set A toa set Bisa 
function that is only defined on a subset of A, (see 
Section 1.4.3). We can also think of such a function as 
mapping A to BU {L} where L is a special “failure” 
symbol such that F(a) = L indicates the function F 
is not defined on a. 


will output 0, which is also OK, since we don’t care about what the 
program outputs on inputs on which DIV is undefined. Formally, we 
define computability of partial functions as follows: 


Definition 7.5 — Computable (partial or total) functions. Let F be either a 
total or partial function mapping {0,1}* to {0, 1}* and let M bea 
Turing machine. We say that M computes F if for every x € {0,1}* 
on which F is defined, M(x) = F(x). We say that a (partial or 
total) function F is computable if there is a Turing machine that 
computes it. 


Note that if F is a total function, then it is defined on every x € 
{0, 1}* and hence in this case, Definition 7.5 is identical to Defini- 
tion 7.2. 


7.2 TURING MACHINES AS PROGRAMMING LANGUAGES 


The name “Turing machine”, with its “tape” and “head” evokes a 
physical object, while in contrast we think of a program as a piece 
of text. But we can think of a Turing machine as a program as well. 
For example, consider the Turing machine M of Section 7.1.1 that 


computes the function PAL such that PAL(x) = 1 iff x is a palindrome. 


We can also describe this machine as a program using the Python-like 
pseudocode of the form below 
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# Gets an array Tape initialized to 
#[">", x_@, x_l1, a KAA ace ball 
# At the end of the execution, Tape[1] is equal to 1 
# if x is a palindrome and is equal to @ otherwise 
def PAL(Tape): 
head = ð 
state = @ # START 
while (state != 12): 
if (state == @ && Tape[head]=='0'): 
state = 3 # LOOK_FOR_@ 
Tape[Lhead] = 'x' 
head += 1 # move right 
if (state==0 && Tape[head]=='1') 
state = 4 # LOOK_FOR_1 
Tape[Lhead] = 'x' 
head += 1 # move right 
# more if statements here 


The precise details of this program are not important. What mat- 
ters is that we can describe Turing machines as programs. Moreover, 
note that when translating a Turing machine into a program, the tape 
becomes a list or array that can hold values from the finite set 0.7 The 
head position can be thought of as an integer-valued variable that holds 
integers of unbounded size. The state is a local register that can hold 
one of a fixed number of values in [k]. 

More generally we can think of every Turing machine M as equiva- 
lent to a program similar to the following: 


# Gets an array Tape initialized to 
#[">", x_@, x_1, a KD) een dl 
def M(Tape): 
state = 0 
i © # holds head location 
while (True): 
# Move head, modify state, write to tape 
# based on current state and cell at head 
# below are just examples for how program looks 


s for a particular transition function 


if TapeLiJ=="0" and state==7: # 
> ò_M(7, "O")=(19, "1 i "R") 
Tape[i]="1" 
i += 1 


state = 19 


? Most programming languages use arrays of fixed 
size, while a Turing machine’s tape is unbounded. But 
of course there is no need to store an infinite number 
of @ symbols. If you want, you can think of the tape 
as a list that starts off just long enough to store the 
input, but is dynamically grown in size as the Turing 
machine’s head explores new positions. 
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elif Tape[i]==">" and state == 13: # 
s OMI ">")=(15, "8", "S") 
Tape[i]="0" 
state = 15 
elif ... 


elif Tape[i]==">" and state == 29: # 
z 5_M(29, SMa: AEE "H") 
break # Halt 


If we wanted to use only Boolean (i.e., 0/1-valued) variables, then 
we can encode the state variables using [log k] bits. Similarly, we 
can represent each element of the alphabet © using £ = [log ||] bits 
and hence we can replace the &-valued array Tape[] with £ Boolean- 
valued arrays TapeQ[],..., Tape(£ — 1)[]. 


7.2.1 The NAND-TM Programming language 

We now introduce the NAND-TM programming language, which cap- 
tures the power of a Turing machine with a programming-language 
formalism. Like the difference between Boolean circuits and Turing 
machines, the main difference between NAND-TM and NAND-CIRC 
is that NAND-TM models a single uniform algorithm that can compute 
a function that takes inputs of arbitrary lengths. To do so, we extend the 
NAND-CIRC programming language with two constructs: 


e Loops: NAND-CIRC is a straight-line programming language- a 
NAND.-CIRC program of s lines takes exactly s steps of computa- 
tion and hence in particular, cannot even touch more than 3s vari- 
ables. Loops allow us to use a fixed-length program to encode the 
instructions for a computation that can take an arbitrary amount of 
time. 


e Arrays: ANAND-CIRC program of s lines touches at most 3s vari- 
ables. While we can use variables with names such as Foo_17 or 
Bar[22] in NAND-CIRC, they are not true arrays, since the number 
in the identifier is a constant that is “hardwired” into the program. 
NAND-IM contains actual arrays that can have a length that is not 
a priori bounded. 


Thus a good way to remember NAND-IM is using the following 
informal equation: 


NAND-TM = NAND-CIRC + loops + arrays (7.1) 
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arrays 


scalars 


index 


10 11 12 #13 14 15 16 17 18 19 20 21 22 23.. 
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Concretely, the NAND-TM programming language adds the fol- 
lowing features on top of NAND-CIRC (see Fig. 7.8): 


We add a special integer valued variable i. All other variables in 
NAND-TM are Boolean valued (as in NAND-CIRC). 


Apart from i NAND-TM has two kinds of variables: scalars and 
arrays. Scalar variables hold one bit (just as in NAND-CIRC). Array 
variables hold an unbounded number of bits. At any point in the 
computation we can access the array variables at the location in- 
dexed by i using Foo[i]. We cannot access the arrays at locations 
other than the one pointed to by i. 


We use the convention that arrays always start with a capital letter, 
and scalar variables (which are never indexed with i) start with 
lowercase letters. Hence Foo is an array and bar is a scalar variable. 


The input and output X and Y are now considered arrays with val- 
ues of zeroes and ones. (There are also two other special arrays 
X_nonblank and Y_nonblank, see below.) 


We add a special MODANDJUMP instruction that takes two Boolean 
variables a, b as input and does the following: 


- Ifa = 1andb = 1 then MODANDJUMP (a,b) increments i by one 
and jumps to the first line of the program. 


Figure 7.8: A NAND-IM program has scalar variables 
that can take a Boolean value, array variables that 
hold a sequence of Boolean values, and a special 
index variable i that can be used to index the array 
variables. We refer to the i-th value of the array 
variable Spam using Spam[i]. At each iteration of 

the program the index variable can be incremented 
or decremented by one step using the MODAND JUMP 
operation. 
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- Ifa = 0and b = 1 then MODANDJUMP (a, b) decrements i by one 
and jumps to the first line of the program. (If i is already equal 
to 0 then it stays at 0.) 

- Ifa = 1 and b = 0 then MODANDJUMP (a, b) jumps to the first line of 
the program without modifying i. 

— Ifa = b = 0 then MODANDJUMP (a, b) halts execution of the 
program. 


e The MODANDJUMP instruction always appears in the last line of a 
NAND-IM program and nowhere else. 


Default values. We need one more convention to handle “default val- 
ues”. Turing machines have the special symbol Ø to indicate that tape 
location is “blank” or “uninitialized”. In NAND-TM there is no such 
symbol, and all variables are Boolean, containing either 0 or 1. All 
variables and locations of arrays default to 0 if they have not been ini- 
tialized to another value. To keep track of whether a 0 in an array cor- 
responds to a true zero or to an uninitialized cell, a programmer can 
always add to an array Foo a “companion array” Foo_nonblank and 
set Foo_nonblank[i] to 1 whenever the i th location is initialized. In 
particular, we will use this convention for the input and output arrays 
X and Y. ANAND-IM program has four special arrays X, X_nonblank, 
Y, and Y_nonblank. When a NAND-IM program is executed on input 
x € {0, 1}* of length n, the first n cells of the array X are initialized to 
To; --- , &n—1 and the first n cells of the array X_nonblank are initialized 
to 1. (All uninitialized cells default to 0.) The output of a NAND-TM 
program is the string Y[0], ..., Y[m — 1] where m is the smallest inte- 
ger such that Y_nonblank[m]= 0. A NAND-IM program gets called 
with X and X_nonblank initialized to contain the input, and writes to Y 
and Y_nonblank to produce the output. 

Formally, NAND-IM programs are defined as follows: 


Definition 7.8 — NAND-TM programs. A NAND-TM program consists of 
a sequence of lines of the form foo = NAND(bar,blah) ending 
with a line of the form MODANDJUMP (foo, bar), where foo,bar,blah 
are either scalar variables (sequences of letters, digits, and under- 
scores) or array variables of the form Foo[i] (starting with capital 
letters and indexed by i). The program has the array variables X, 
X_nonblank, Y, Y_nonblank and the index variable i built in, and 
can use additional array and scalar variables. 

If P is a NAND-IM program and x € {0,1}* is an input then an 
execution of P on z is the following process: 
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1. The arrays X and X_nonblank are initialized by X[i]= 
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x, and 


X_nonblank[z]= 1 for alli € [|x|]. All other variables and cells 
are initialized to 0. The index variable i is also initialized to 0. 


2. The program is executed line by line. When the last line MODAND - 
JUMP (foo, bar) is executed we do as follows: 


a. If foo= land bar= 0, jump to the first line without modify- 


ing the value of i. 


b. If foo= 1 and bar= 
first line. 
c. If foo= Oandbar= 


1, increment i by one and jump to the 


1, decrement i by one (unless it is al- 


. If foo= 


ready zero) and jump to the first line. 


0and bar= 0,halt and output Y[0], ..., Yim — 1] 
where m is the smallest integer such that Y_nonblank[m]= 0. 


7.2.2 Sneak peak: NAND-TM vs Turing machines 
As the name implies, NAND-TM programs are a direct implemen- 


tation of Turing machines in programming language form. We will 


show the equivalence below, but you can already see how the compo- 


nents of Turing machines and NAND-IM programs correspond to one 


another: 


Table 7.2: Turing Machine and NAND-TM analogs 


Turing Machine 


NAND-TM program 


State: single register that 
takes values in [k] 


Tape: One tape containing 
values in a finite set ©. 
Potentially infinite but T [t] 
defaults to @ for all locations 
t that have not been 
accessed. 

Head location: A number 

i € N that encodes the 
position of the head. 


Scalar variables: Several variables 
such as foo, bar etc.. each taking 
values in {0,1}. 

Arrays: Several arrays such as Foo, 
Bar etc.. for each such array Arr and 
index j, the value of Arr at position j 
is either 0 or 1. The value defaults to 
0 for position that have not been 
written to. 

Index variable: The variable i that can 
be used to access the arrays. 
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NAND-TM program 


Turing Machine 


Accessing memory: At every 
step the Turing machine has 
access to its local state, but 
can only access the tape at 
the position of the current 
head location. 

Control of location: In each 
step the machine can move 
the head location by at most 
one position. 


7.2.3 Examples 


Accessing memory: At every step a 
NAND-IM program has access to all 
the scalar variables, but can only 
access the arrays at the location i of 
the index variable 


Control of index variable: In each 
iteration of its main loop the 
program can modify the index i by 
at most one. 


We now present some examples of NAND-IM programs. 


m Example 7.9 — Increment in NAND-TM. The following is a NAND-TM 
program to compute the increment function. That is, INC : {0,1}* > 
{0,1}* such that for every x € {0,1}", INC(z) is the n + 1 bit long 


n—-1 


string y such that if X = }7,_ 5 


x, - 2‘ is the number represented by 


x, then y is the (least-significant digit first) binary representation of 


the number X + 1. 


We start by describing the program using “syntactic sugar” for 
NAND-CIRC for the IF, XOR and AND functions (as well as the con- 
stant one function, and the function COPY that just maps a bit to 


itself). 


carry = IF(started,carry,one(started) ) 


started = one(started) 
YCi] = XORC(XCiJ, carry) 


carry = AND(X[i],carry) 


Y_nonblank[i] = one(started) 
MODANDJUMP (X_nonblank[iJ, X_nonblank[iJ) 


Since we used syntactic sugar, the above is not, strictly speaking, 


a valid NAND-IM program. However, by “opening up” all the 


syntactic sugar, we get the following “sugar free” valid program to 


compute the same function. 


temp_® = NAND(started, started) 
temp_1 = NAND(started, temp_Q) 
temp_2 = NAND(started, started) 
temp_3 = NAND(temp_1, temp_2) 
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temp_4 = NAND(carry,started) 

carry = NAND(temp_3, temp_4) 

temp_6 = NAND(started, started) 
started = NAND(started, temp_6) 

temp_8 = NAND(XLil,carry) 

temp_9 = NAND(X[i], temp_8) 

temp_1@ = NAND(carry, temp_8) 

Y[i] = NAND(temp_9, temp_10) 

temp_12 = NAND(X[Lil, carry) 

carry = NAND(temp_12,temp_12) 

temp_14 = NAND(started, started) 
Y_nonblank[i] = NAND(started, temp_14) 
MODANDJUMP (X_nonblank[iJ, X_nonblank[i]J) 


m Example 7.10 — XOR in NAND-TM. The following is a NAND-IM pro- 
gram to compute the XOR function on inputs of arbitrary length. 
That isXOR : {0,1}" > {0,1} such that XOR(z) = DED a, 
mod 2foreveryz € {0,1}*. Once again, we use a certain “syn- 
tactic sugar”. Specifically, we access the arrays X and Y at their 
zero-th entry, while NAND-IM only allows access to arrays in the 


coordinate of the variable i. 


temp_@ = NAND(X[@],X[@]) 
Y_nonblank[@] = NAND(X[Q], temp_0) 


temp_2 = NAND(XLiJ, Y[@]) 
temp_3 = NAND(X[i], temp_2) 
temp_4 = NAND(Y[Q], temp_2) 


Y[@] = NAND(temp_3, temp_4) 
MODANDJUMP (X_nonblank[iJ, X_nonblank[i]J) 


To transform the program above to a valid NAND-TM program, 
we can transform references such as X[@] and Y[@] to scalar vari- 
ables x_@ and y_@ (similarly we can transform any reference of the 
form Foo[17] or Bar[15] to scalars such as foo_17 and bar_15). 
We then need to add code to load the value of X[0] to x_@ and 
similarly to write to Y[0] the value of y_Q, but this is not hard to 
do. Using the fact that variables are initialized to zero by default, 
we can create a variable init which will be set to 1 at the end of 
the first iteration and not changed since then. We can then add an 
array Atzero and code that will modify Atzero[i] to 1 if init is 
0 and otherwise leave it as it is. This will ensure that Atzero[i] is 
equal to 1 if and only if i is set to zero, and allow the program to 
know when we are at the zeroth location. Thus we can add code 


to read and write to the corresponding scalars x_0, y_@ when we 
are at the zeroth location, and also code to move i to zero and then 
halt at the end. Working this out fully is somewhat tedious, but can 
be a good exercise. 


7.3 EQUIVALENCE OF TURING MACHINES AND NAND-TM PRO- 
GRAMS 


Given the above discussion, it might not be surprising that Turing 
machines turn out to be equivalent to NAND-IM programs. Indeed, 
we designed the NAND-TM language to have this property. Never- 
theless, this is a significant result, and the first of many other such 
equivalence results we will see in this book. 


Theorem 7.11 — Turing machines and NAND-TM programs are equivalent. For 
every F : {0,1}* — {0,1}*, F is computable by a NAND-IM pro- 
gram P if and only if there is a Turing machine M that computes 

F. 


Proof Idea: 

To prove such an equivalence theorem, we need to show two di- 
rections. We need to be able to (1) transform a Turing machine M to 
a NAND-IM program P that computes the same function as M and 
(2) transform a NAND-IM program P into a Turing machine M that 
computes the same function as P. 

The idea of the proof is illustrated in Fig. 7.9. To show (1), given 
a Turing machine M, we will create a NAND-IM program P that 
will have an array Tape for the tape of M and scalar (i.e., non-array) 
variable(s) state for the state of M. Specifically, since the state of a 
Turing machine is not in {0, 1} but rather in a larger set [k], we will use 
[log k] variables state_0,..., state_[log k] — 1 variables to store the 
representation of the state. Similarly, to encode the larger alphabet © 
of the tape, we will use [log |X|] arrays Tape_0,..., Tape_/log ||| — 1, 
such that the it” location of these arrays encodes the it” symbol in the 
tape for every tape. Using the fact that every function can be computed 
by a NAND-CIRC program, we will be able to compute the transition 
function of M, replacing moving left and right by decrementing and 
incrementing i respectively. 
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We show (2) using very similar ideas. Given a program P that uses 
a array variables and b scalar variables, we will create a Turing ma- 
chine with about 2° states to encode the values of scalar variables, and 
an alphabet of about 2° so we can encode the arrays using our tape. 
(The reason the sizes are only “about” 2° and 2? is that we need to 
add some symbols and steps for bookkeeping purposes.) The Turing 
machine M simulates each iteration of the program P by updating its 
state and tape accordingly. 


*% 


er a 


Local State: 


sE g, 


Proof of Theorem 7.11. We start by proving the “if” direction of The- 
orem 7.11. Namely we show that given a Turing machine M, we can 
find a NAND-TM program Py such that for every input x, if M halts 
on input x with output y then Py(x) = y. Since our goal is just to 
show such a program Py exists, we don’t need to write out the full 
code of Py, line by line, and can take advantage of our various “syn- 
tactic sugar” in describing it. 

The key observation is that by Theorem 4.12 we can compute every 
finite function using a NAND-CIRC program. In particular, consider 
the transition function ôy : [k] x E — [k] x £ x {L, R, S, H} of our Turing 
machine. We can encode its components as follows: 


e We encode [k] using {0, 1}‘ and ¥ using {0,1}“, where £ = [log k] 
and f’ = [log |]. 


e We encode the set {L, R, S, H} using {0,1}. We will choose the 
encoding L ++ 01,R > 11,S > 10,H + 00. (This conveniently 
corresponds to the semantics of the MODANDJUMP operation.) 


Hence we can identify 5,, with a function M : {0,1}4*" => 
{0, 1} “+2, mapping strings of length £ + ¢’ to strings of length 
L+ L’ + 2. By Theorem 4.12 there exists a finite length NAND-CIRC 
program ComputeM that computes this function M. The idea behind 
the NAND-IM program to simulate M is to: 


1. Use variables state_0... state_f — 1 to encode M’s state. 


Figure 7.9: Comparing a Turing machine to a NAND- 
TM program. Both have an unbounded memory 
component (the tape for a Turing machine, and the ar- 
rays for a NAND-IM program), as well as a constant 
local memory (state for a Turing machine, and scalar 
variables for a NAND-IM program). Both can only 
access at each step one location of the unbounded 
memory, this is the “head” location for a Turing 
machine, and the value of the index variable i fora 
NAND-IM program. 
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2. Use arrays Tape_0[] ... Tape_¢’ — 1[] to encode M’s tape. 


3. Use the fact that transition is finite and computable by NAND- 
CIRC program. 


Given the above, we can write code of the form: 

state_0... state_f — 1, Tape_OLi]... Tape_@’ — 1[i], dir@,dir1 + 
TRANSITION( state_0... state_¢ — 1, Tape_OLi]... Tape_@’ — 1[i], 
dirQ,dir1 ) 

MODANDJUMP (dir@,dir1) 

Every step of the main loop of the above program perfectly mimics 
the computation of the Turing machine M, and so the program carries 
out exactly the definition of computation by a Turing machine as per 
Definition 7.1. 

For the other direction, suppose that P is a NAND-TM program 
with s lines, £ scalar variables, and ¢’ array variables. We will show 
that there exists a Turing machine Mp with 2‘ + C states and alphabet 
> of size C’ + 2” that computes the same functions as P (where C, C’ 
are some constants to be determined later). 

Specifically, consider the function P : {0,1} x {0,1}" — {0,1}! x 
{0,1}” that on input the contents of P’s scalar variables and the con- 
tents of the array variables at location i in the beginning of an itera- 
tion, outputs all the new values of these variables at the last line of the 
iteration, right before the MODANDJUMP instruction is executed. 

If foo and bar are the two variables that are used as input to the 
MODANDJUMP instruction, then based on the values of these variables we 
can compute whether i will increase, decrease or stay the same, and 
whether the program will halt or jump back to the beginning. Hence a 
Turing machine can simulate an execution of P in one iteration using 
a finite function applied to its alphabet. The overall operation of the 
Turing machine will be as follows: 


1. The machine Mp encodes the contents of the array variables of P 
in its tape and the contents of the scalar variables in (part of) its 
state. Specifically, if P has £ local variables and t arrays, then the 
state space of M will be large enough to encode all 2° assignments 
to the local variables, and the alphabet £ of M will be large enough 
to encode all 2* assignments for the array variables at each location. 
The head location corresponds to the index variable i. 


2. Recall that every line of the program P corresponds to reading 
and writing either a scalar variable, or an array variable at the loca- 
tion i. In one iteration of P the value of i remains fixed, and so the 
machine M can simulate this iteration by reading the values of all 
array variables at i (which are encoded by the single symbol in the 
alphabet X located at the i-th cell of the tape) , reading the values 
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of all scalar variables (which are encoded by the state), and updat- 

ing both. The transition function of M can output L, S, R depending 
on whether the values given to the MODANDJUMP operation are 01, 10 
or 11 respectively. 


3. When the program halts (i-e., MODANDJUMP gets 00) then the Turing 
machine will enter into a special loop to copy the results of the Y 
array into the output and then halt. We can achieve this by adding a 
few more states. 


The above is not a full formal description of a Turing machine, but 
our goal is just to show that such a machine exists. One can see that 
Mp simulates every step of P, and hence computes the same function 
as P. 
E 


7.3.1 Specification vs implementation (again) 


Once you understand the definitions of both NAND-TM programs 
and Turing machines, Theorem 7.11 is straightforward. Indeed, 
NAND-TM programs are not as much a different model from Tur- 
ing machines as they are simply a reformulation of the same model 
using programming language notation. You can think of the differ- 
ence between a Turing machine and a NAND-TM program as the 
difference between representing a number using decimal or binary 
notation. In contrast, the difference between a function F and a Turing 
machine that computes F is much more profound: it is like the differ- 
ence between the equation r? +x = 12, and the number 3 that is a 
solution for this equation. For this reason, while we take special care 
in distinguishing functions from programs or machines, we will often 
identify the two latter concepts. We will move freely between describ- 
ing an algorithm as a Turing machine or as a NAND-TM program (as 
well as some of the other equivalent computational models we will see 
in Chapter 8 and beyond). 
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Table 7.3: Specification vs Implementation formalisms 


Setting Specification Implementation 


Finitecom- Functions mapping {0,1}" to Circuits, Straightline 


putation {0,1}™ programs 

Infinite Functions mapping {0,1}*to Algorithms, Turing 
computa- {0,1} or to {0, 1}*. Machines, Programs 
tion 


7.4 NAND-TM SYNTACTIC SUGAR 


Just like we did with NAND-CIRC in Chapter 4, we can use “syntactic 
sugar” to make NAND-IM programs easier to write. For starters, we 
can use all of the syntactic sugar of NAND-CIRC, such as macro def- 
initions and conditionals (i.e., if/then). However, we can go beyond 
this and achieve (for example): 


e Inner loops such as the while and for operations common to many 
programming languages. 


e Multiple index variables (e.g., not just i but we can add j, k, etc.). 


e Arrays with more than one dimension (e.g., Foolil[j], 
Bar[i][j][k] etc.) 


In all of these cases (and many others) we can implement the new 
feature as mere “syntactic sugar” on top of standard NAND-IM. This 
means that the set of functions computable by NAND-IM with this 
feature is the same as the set of functions computable by standard 
NAND-IM. Similarly, we can show that the set of functions com- 
putable by Turing machines that have more than one tape, or tapes 
of more dimensions than one, is the same as the set of functions com- 
putable by standard Turing machines. 


7.4.1 “GOTO” and inner loops 

We can implement more advanced looping constructs than the simple 
MODAND JUMP. For example, we can implement GOTO. A GOTO statement 
corresponds to jumping to a specific line in the execution. For exam- 
ple, if we have code of the form 


"start": do foo 
GOTO("end") 

"skip": do bar 

"end": do blah 
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then the program will only do foo and blah as when it reaches the 
line GOTO("end") it will jump to the line labeled with "end". We can 
achieve the effect of GOTO in NAND-TM using conditionals. In the 
code below, we assume that we have a variable pc that can take strings 
of some constant length. This can be encoded using a finite number 
of Boolean variables pc_@, pc_1,..., pc_k — 1, and so when we write 
below pc = "label" what we mean is something like pc_@ = Q,pc_1 
= 1,... (where the bits 0, 1, ... correspond to the encoding of the finite 
string "label" as a string of length k). We also assume that we have 
access to conditional (ie., if statements), which we can emulate using 
syntactic sugar in the same way as we did in NAND-CIRC. 

To emulate a GOTO statement, we will first modify a program P of 
the form 


do foo 
do bar 
do blah 


to have the following form (using syntactic sugar for if): 


pe = "linel" 
if (pc=="line1"): 
do foo 
pe = "line2" 
if (pce=="line2"): 
do bar 
pe = "lLine3" 
if (pc=="line3"): 
do blah 


These two programs do the same thing. The variable pc cor- 
responds to the “program counter” and tells the program which 
line to execute next. We can see that if we wanted to emulate a 
GOTO("line3") then we could simply modify the instruction pc = 
"line2"tobepc = "line3". 

In NAND-CIRC we could only have GOTOs that go forward in the 
code, but since in NAND-TM everything is encompassed within a 
large outer loop, we can use the same ideas to implement GOTOs that 
can go backward, as well as conditional loops. 


Other loops. Once we have GOTO, we can emulate all the standard loop 
constructs such as while,do .. until or for in NAND-IM as well. 
For example, we can replace the code 


while foo: 
do blah 
do bar 


with 


"loop": 


if NOT(foo): GOTOC"next") 
do blah 
GOTO("loop") 


"next": 


do bar 


@) 


Remark 7.13 — GOTO’s in programming languages. The 
GOTO statement was a staple of most early program- 
ming languages, but has largely fallen out of favor and 
is not included in many modern languages such as 
Python, Java, Javascript. In 1968, Edsger Dijsktra wrote a 
famous letter titled “Go to statement considered harm- 
ful.” (see also Fig. 7.10). The main trouble with GOTO 
is that it makes analysis of programs more difficult 

by making it harder to argue about invariants of the 
program. 

When a program contains a loop of the form: 


for j in range(100): 
do something 


do blah 


you know that the line of code do blah can only be 
reached if the loop ended, in which case you know 
that j is equal to 100, and might also be able to argue 
other properties of the state of the program. In con- 
trast, if the program might jump to do blah from any 
other point in the code, then it’s very hard for you as 
the programmer to know what you can rely upon in 
this code. As Dijkstra said, such invariants are im- 
portant because _ “our intellectual powers are rather 
geared to master static relations and .. our powers 

to visualize processes evolving in time are relatively 
poorly developed”_ and so ” we should ... do ...our 
utmost best to shorten the conceptual gap between the static 
program and the dynamic process.” 

That said, GOTO is still a major part of lower level lan- 
guages where it is used to implement higher-level 
looping constructs such as while and for loops. 

For example, even though Java doesn’t have a GOTO 
statement, the Java Bytecode (which is a lower-level 
representation of Java) does have such a statement. 
Similarly, Python bytecode has instructions such as 
POP_JUMP_IF_TRUE that implement the GOTO function- 
ality, and similar instructions are included in many 
assembly languages. The way we use GOTO to imple- 
ment a higher-level functionality in NAND-IM is 
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T COULD RESTRUCTURE | | EH, SCREW GOOD PRACTICE. 

THE PROGRAMS FLOW | | HOW GAD CAN IT BE? 
OR USE ONE LITLE goto main-sub3; 
‘Gow INSTEAD. 


7.5 UNIFORMITY, AND NAND VS NAND-TM (DISCUSSION) stress ie fic 
While NAND-IM adds extra operations over NAND-CIRC, it is not ga oe - 
exactly accurate to say that NAND-TM programs or Turing machines 
, : Figure 7.10: XKCD’s take on the GOTO statement. 
are “more powerful” than NAND-CIRC programs or Boolean circuits. 
NAND-CIRC programs, having no loops, are simply not applicable 
for computing functions with an unbounded number of inputs. Thus, 
to compute a function F : {0,1}* :—> {0,1}* using NAND-CIRC (or 
equivalently, Boolean circuits) we need a collection of programs /cir- 
cuits: one for every input length. 

The key difference between NAND-CIRC and NAND-IM is that 
NAND-IM allows us to express the fact that the algorithm for com- 
puting parities of length-100 strings is really the same one as the al- 
gorithm for computing parities of length-5 strings (or similarly the 
fact that the algorithm for adding n-bit numbers is the same for every 
n, etc.). That is, one can think of the NAND-IM program for general 
parity as the “seed” out of which we can grow NAND-CIRC programs 
for length 10, length 100, or length 1000 parities as needed. 

This notion of a single algorithm that can compute functions of all 
input lengths is known as uniformity of computation. Hence we think 
of Turing machines / NAND-TM as uniform models of computation, 
as opposed to Boolean circuits or NAND-CIRC, which are non-uniform 
models, in which we have to specify a different program for every 
input length. 

Looking ahead, we will see that this uniformity leads to another 
crucial difference between Turing machines and circuits. Turing ma- 
chines can have inputs and outputs that are longer than the descrip- 
tion of the machine as a string, and in particular there exists a Turing 
machine that can “self replicate” in the sense that it can print its own 
code. The notion of “self replication”, and the related notion of “self 
reference” are crucial to many aspects of computation, and beyond 
that to life itself, whether in the form of digital or biological programs. 

For now, what you ought to remember is the following differences 
between uniform and non-uniform computational models: 


e Non-uniform computational models: Examples are NAND-CIRC 
programs and Boolean circuits. These are models where each indi- 
vidual program/circuit can compute a finite function f : {0,1}" > 
{0, 1}™. We have seen that every finite function can be computed by 
some program /circuit. To discuss computation of an infinite func- 
tion F : {0,1}* > {0, 1}* we need to allow a sequence { P, }nen of 


programs/circuits (one for every input length), but this does not 
capture the notion of a single algorithm to compute the function F. 


e Uniform computational models: Examples are Turing machines and 
NAND-TM programs. These are models where a single program /- 
machine can take inputs of arbitrary length and hence compute an 
infinite function F : {0,1}* — {0,1}*. The number of steps that 
a program/machine takes on some input is not a priori bounded 
in advance and in particular there is a chance that it will enter into 
an infinite loop. Unlike the non-uniform case, we have not shown 
that every infinite function can be computed by some NAND-TM 


program/Turing machine. We will come back to this point in Chap- 
ter 9. 


7.6 EXERCISES 


Exercise 7.1 — Explicit NAND TM programming. Produce the code of a 
(syntactic-sugar free) NAND-IM program P that computes the (un- 
bounded input length) Majority function Maj : {0,1}* — {0,1} where 
for every x € {0,1}*, Maj(x) = 1 if and only if pal x; > |x|/2. We 
say “produce” rather than “write” because you do not have to write 
the code of P by hand, but rather can use the programming language 


of your choice to compute this code. 


Exercise 7.2 — Computable functions examples. Prove that the following 
functions are computable. For all of these functions, you do not have 
to fully specify the Turing machine or the NAND-IM program that 
computes the function, but rather only prove that such a machine or 
program exists: 


1. INC : {0,1}* — {0,1}* which takes as input a representation of a 
natural number n and outputs the representation of n + 1. 
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2. ADD : {0,1}* — {0,1}* which takes as input a representation of 
a pair of natural numbers (n,m) and outputs the representation of 


n+m. 


3. MULT : {0,1}* — {0,1}*, which takes a representation of a pair of 
natural numbers (n, m) and outputs the representation of nrn. 


4. SORT : {0,1}* — {0, 1}* which takes as input the representation of 
a list of natural numbers (ap,...,@,,_,) and returns its sorted version 
(bg, -.., 0,1) such that for every i € [n] there is some j € [n] with 
b; = aj and by < bi < + < bn: 


Exercise 7.3 — Two index NAND-TM. Define NAND-I'M’ to be the variant 
of NAND-TM where there are two index variables i and j. Arrays 
can be indexed by either i or j. The operation MODANDJUMP takes four 
variables a, b,c, d and uses the values of c, d to decide whether to 
increment j, decrement j or keep it in the same value (correspond- 
ing to 01, 10, and 00 respectively). Prove that for every function 

F : {0,1}* > {0,1}*, F is computable by a NAND-IM program if 
and only if F is computable by a NAND-TM’ program. 


Exercise 7.4 — Two tape Turing machines. Define a two tape Turing machine 
to be a Turing machine which has two separate tapes and two separate 
heads. At every step, the transition function gets as input the location 
of the cells in the two tapes, and can decide whether to move each 
head independently. Prove that for every function F : {0,1}* > 

{0, 1}*, F is computable by a standard Turing machine if and only if F 
is computable by a two-tape Turing machine. 


Exercise 7.5 — Two dimensional arrays. Define NAND-TM” to be the vari- 
ant of NAND-TM where just like NAND-TM’ defined in Exercise 7.3 
there are two index variables i and j, but now the arrays are two di- 
mensional and so we index an array Foo by Foo[iJ[jJ]. Prove that for 
every function F : {0,1}* — {0,1}*, F is computable by a NAND-TM 
program if and only if F is computable by a NAND-TM’’ program. 


Exercise 7.6 — Two dimensional Turing machines. Define a two-dimensional 
Turing machine to be a Turing machine in which the tape is two dimen- 
sional. At every step the machine can move Up, Down, Left, Right, or 
Stay. Prove that for every function F : {0,1}* > {0,1}*, F is com- 
putable by a standard Turing machine if and only if F is computable 
by a two-dimensional Turing machine. 
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Exercise 7.7 Prove the following closure properties of the set R defined 
in Definition 7.3: 


1. If F € R then the function G(x) = 1 — F(x) is in R. 
2. If F,G € R then the function H (x) = F(x) V G(x) isin R. 


3. If F € R then the function F* in in R where F* is defined as fol- 
lows: F*(x) = 1 iff there exist some strings wọ, ... , Wg—ı Such that 
L = WoW wg and F(w;) = 1 for every i € [k]. 


4. If F € R then the function 


G(x) = d weoi ey) = 1 
: otherwise 


isin R. 


Exercise 7.8 — Oblivious Turing Machines (challenging). Define a Turing ma- 
chine M to be oblivious if its head movements are independent of its 
input. That is, we say that M is oblivious if there exists an infinite 
sequence MOVE € {L,R,S}° such that for every x € {0,1}*, the 
movements of M when given input z (up until the point it halts, if 
such point exists) are given by MOVE,, MOVE,, MOVE5.,.... 

Prove that for every function F : {0,1}* > {0,1}*, if F is com- 
putable then it is computable by an oblivious Turing machine. See 
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+93 3 
footnote for hint. 3 You can use the sequence R, LR, R, L, L, R,R,R, L, L, L, 


Exercise 7.9 — Single vs multiple bit. Prove that for every F : {0,1} > 
{0, 1}*, the function F is computable if and only if the following func- 
tion G : {0,1}* > {0,1} is computable, where G is defined as follows: 
F(x); i<|F(«)|,¢ =0 
G(a,1,0) = <1 i< |F(2)|,o=1 
0 i> |F(a) 


Exercise 7.10 — Uncomputability via counting. Recall that R is the set of all 
total functions from {0,1}* to {0,1} that are computable by a Turing 
machine (see Definition 7.3). Prove that R is countable. That is, prove 
that there exists a one-to-one map DiN : R — N. You can use the 
equivalence between Turing machines and NAND-IM programs. 


Exercise 7.11 — Not every function is computable. Prove that the set of all 
total functions from {0,1}* — {0,1} is not countable. You can use the 
results of Section 2.4. (We will see an explicit uncomputable function 
in Chapter 9.) 
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7.7 BIBLIOGRAPHICAL NOTES 


Augusta Ada Byron, countess of Lovelace (1815-1852) lived a short 
but turbulent life, though is today most well known for her collabo- 
ration with Charles Babbage (see [Ste87] for a biography). Ada took 
an immense interest in Babbage’s analytical engine, which we men- 
tioned in Chapter 3. In 1842-3, she translated from Italian a paper of 
Menabrea on the engine, adding copious notes (longer than the paper 
itself). The quote in the chapter’s beginning is taken from Nota A in 
this text. Lovelace’s notes contain several examples of programs for the 
analytical engine, and because of this she has been called “the world’s 
first computer programmer” though it is not clear whether they were 
written by Lovelace or Babbage himself [H0101]. Regardless, Ada was 
clearly one of very few people (perhaps the only one outside of Bab- 
bage himself) to fully appreciate how important and revolutionary the 
idea of mechanizing computation truly is. 

The books of Shetterly [She16] and Sobel [Sob17] discuss the his- 
tory of human computers (who were female, more often than not) 
and their important contributions to scientific discoveries in astron- 
omy and space exploration. 

Alan Turing was one of the intellectual giants of the 20th century. 
He was not only the first person to define the notion of computation, 
but also invented and used some of the world’s earliest computational 
devices as part of the effort to break the Enigma cipher during World 
War II, saving millions of lives. Tragically, Turing committed suicide 
in 1954, following his conviction in 1952 for homosexual acts and a 
court-mandated hormonal treatment. In 2009, British prime minister 
Gordon Brown made an official public apology to Turing, and in 2013 
Queen Elizabeth II granted Turing a posthumous pardon. Turing’s life 
is the subject of a great book and a mediocre movie. 

Sipser’s text [Sip97] defines a Turing machine as a seven tuple con- 
sisting of the state space, input alphabet, tape alphabet, transition 
function, starting state, accepting state, and rejecting state. Superfi- 
cially this looks like a very different definition than Definition 7.1 but 
it is simply a different representation of the same concept, just as a 
graph can be represented in either adjacency list or adjacency matrix 
form. 

One difference is that Sipser considers a general set of states Q that 
is not necessarily of the form Q = {0,1, 2, ...,k — 1} for some natural 
number k > 0. Sipser also restricts his attention to Turing machines 
that output only a single bit and therefore designates two special halt- 
ing states: the “0 halting state” (often known as the rejecting state) and 
the other as the “1 halting state” (often known as the accepting state). 


Thus instead of writing 0 or 1 on an output tape, the machine will en- 
ter into one of these states and halt. This again makes no difference 

to the computational power, though we prefer to consider the more 
general model of multi-bit outputs. (Sipser presents the basic task of a 
Turing machine as that of deciding a language as opposed to computing 
a function, but these are equivalent, see Remark 7.4.) 

Sipser considers also functions with input in X* for an arbitrary 
alphabet © (and hence distinguishes between the input alphabet which 
he denotes as X and the tape alphabet which he denotes as I’), while we 
restrict attention to functions with binary strings as input. Again this 
is not a major issue, since we can always encode an element of © using 
a binary string of length log| ||]. Finally (and this is a very minor 
point) Sipser requires the machine to either move left or right in every 
step, without the Stay operation, though staying in place is very easy 
to emulate by simply moving right and then back left. 

Another definition used in the literature is that a Turing machine 
M recognizes a language L if for every x € L, M(x) = 1and for 
every x ¢ L, M(x) € {0,1}. A language L is recursively enumerable if 
there exists a Turing machine M that recognizes it, and the set of all 
recursively enumerable languages is often denoted by RE. We will not 
use this terminology in this book. 

One of the first programming-language formulations of Turing 
machines was given by Wang [Wan57]. Our formulation of NAND- 
TM is aimed at making the connection with circuits more direct, with 
the eventual goal of using it for the Cook-Levin Theorem, as well as 
results such as P C P pory and BPP C P poy: The website esolangs.org 
features a large variety of esoteric Turing-complete programming 
languages. One of the most famous of them is Brainf*ck. 
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Learning Objectives: 


e Learn about RAM machines and the À 
calculus. 


Equivalence between these and other models 
and Turing machines. 


Cellular automata and configurations of 
Turing machines. 


Understand the Church-Turing thesis. 


8 
Equivalent models of computation 


“All problems in computer science can be solved by another level of indirec- 
tion”, attributed to David Wheeler. 


“Because we shall later compute with expressions for functions, we need a 
distinction between functions and forms and a notation for expressing this 
distinction. This distinction and a notation for describing it, from which we 
deviate trivially, is given by Church.” , John McCarthy, 1960 (in paper 
describing the LISP programming language) 


So far we have defined the notion of computing a function using 
Turing machines, which are not a close match to the way computation 
is done in practice. In this chapter we justify this choice by showing 
that the definition of computable functions will remain the same 
under a wide variety of computational models. This notion is known 
as Turing completeness or Turing equivalence and is one of the most 
fundamental facts of computer science. In fact, a widely believed 
claim known as the Church-Turing Thesis holds that every “reasonable” 
definition of computable function is equivalent to being computable 
by a Turing machine. We discuss the Church-Turing Thesis and the 
potential definitions of “reasonable” in Section 8.8. 

Some of the main computational models we discuss in this chapter 
include: 


e RAM Machines: Turing machines do not correspond to standard 
computing architectures that have Random Access Memory (RAM). 
The mathematical model of RAM machines is much closer to actual 
computers, but we will see that it is equivalent in power to Turing 
machines. We also discuss a programming language variant of 
RAM machines, which we call NAND-RAM. The equivalence of 
Turing machines and RAM machines enables demonstrating the 
Turing Equivalence of many popular programming languages, in- 
cluding all general-purpose languages used in practice such as C, 
Python, JavaScript, etc. 


Compiled on 12.19.2022 22:58 
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e Cellular Automata: Many natural and artificial systems can be 
modeled as collections of simple components, each evolving ac- 
cording to simple rules based on its state and the state of its imme- 
diate neighbors. One well-known such example is Conway’s Game 
of Life. To prove that cellular automata are equivalent to Turing 
machines we introduce the tool of configurations of Turing machines. 
These have other applications, and in particular are used in Chap- 
ter 11 to prove Gédel’s Incompleteness Theorem: a central result in 
mathematics. 


e à calculus: The A calculus is a model for expressing computation 
that originates from the 1930’s, though it is closely connected to 
functional programming languages widely used today. Showing 
the equivalence of À calculus to Turing machines involves a beauti- 
ful technique to eliminate recursion known as the “Y Combinator”. 
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Figure 8.1: Some Turing-equivalent models. All of 


& Turing Machines / these are equivalent in power to Turing machines 
NAND-RAM => NAND-TM ej (or equivalently NAND-TM programs) in the sense 
4 that they can compute exactly the same class of 
a $ 


programs can only compute finite functions and hence 
are not Turing complete. 


functions. All of these are models for computing 
z infinite functions that take inputs of unbounded 
e] E ai length. In contrast, Boolean circuits / NAND-CIRC 
Lisp 


8.1 RAM MACHINES AND NAND-RAM 


One of the limitations of Turing machines (and NAND-TM programs) 
is that we can only access one location of our arrays/tape at a time. If 
the head is at position 22 in the tape and we want to access the 957-th 
position then it will take us at least 923 steps to get there. In contrast, 
almost every programming language has a formalism for directly 
accessing memory locations. Actual physical computers also provide 
so called Random Access Memory (RAM) which can be thought of as a 
large array Memory, such that given an index p (i.e. memory address, 
or a pointer), we can read from and write to the p“” location of Memory. 
(“Random access memory” is quite a misnomer since it has nothing to 
do with probability, but since it is a standard term in both the theory 
and practice of computing, we will use it as well.) 

The computational model that models access to such a memory is 
the RAM machine (sometimes also known as the Word RAM model), 
as depicted in Fig. 8.2. The memory of a RAM machine is an array 
of unbounded size where each cell can store a single word, which 
we think of as a string in {0, 1}” and also (equivalently) as a num- 
ber in [2]. For example, many modern computing architectures 
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use 64 bit words, in which every memory location holds a string in 
{0, 1}64 which can also be thought of as a number between 0 and 

264 — 1 = 18, 446, 744, 073, 709, 551,615. The parameter w is known 
as the word size. In practice often w is a fixed number such as 64, but 
when doing theory we model w as a parameter that can depend on 
the input length or number of steps. (You can think of 2” as roughly 
corresponding to the largest memory address that we use in the com- 
putation.) In addition to the memory array, a RAM machine also 
contains a constant number of registers ro, ..., 7,1, each of which can 
also contain a single word. 


4378 


2 0 | 23432 | 324 


Local Registers: 


reg_2:45 


The operations a RAM machine can carry out include: 7 

e Data movement: Load data from a certain cell in memory into 
a register or store the contents of a register into a certain cell of veg. 123 
memory. A RAM machine can directly access any cell of memory 


reg_t: 893 


reg_j := MEMORY [r 


MEMORY [reg_1] 


eg_i] 


= rej_k 


reg_k := reg_j OP reg 1 


without having to move the “head” (as Turing machines do) to that 
location. That is, in one step a RAM machine can load into register 


Figure 8.2: A RAM Machine contains a finite number of 


r; the contents of the memory cell indexed by register rj, or store local registers, each of which holds an integer, and an 


into the memory cell indexed by register r; the contents of register 


unbounded memory array. It can perform arithmetic 
operations on its register as well as load to a register r 


Ti. the contents of the memory at the address indexed by 
the number in register r’. 


e Computation: RAM machines can carry out computation on regis- 
ters such as arithmetic operations, logical operations, and compar- 
isons. 


e Control flow: As in the case of Turing machines, the choice of what 
instruction to perform next can depend on the state of the RAM 
machine, which is captured by the contents of its register. 


We will not give a formal definition of RAM Machines, though the 


bibliographical notes section (Section 8.10) contains sources for such RAM Machine: 


definitions. Just as the NAND-TM programming language models 


Indirect addressing 
/ pointers 


Turing machines, we can also define a NAND-RAM programming lan- 
guage that models RAM machines. The NAND-RAM programming 
language extends NAND-IM by adding the following features: 


e The variables of NAND-RAM are allowed to be (non-negative) Local registers 


hold O(log t) bit 


integer valued rather than only Boolean as is the case in NAND- 
TM. That is, a scalar variable foo holds a non-negative integer in N 


numbers 


Arithmetic 
operations 
at unit cost 


Turing Machine: 
Each step head 
moves < 1 step 


All operations 
on 0(1) bits 


Local state 
is O(1) size 


Figure 8.3: Different aspects of RAM machines and 


(rather than only a bit in {0, 1}), and an array variable Bar holds Turing machines. RAM machines can store integers 
an array of integers. As in the case of RAM machines, we will not in their local registers, and can read and write to 


allow integers of unbounded size. Concretely, each variable holds 


their memory at a location specified by a register. 
In contrast, Turing machines can only access their 


a number between 0 and T — 1, where T is the number of steps memory in the head location, which moves at most 


that have been executed by the program so far. (You can ignore 
this restriction for now: if we want to hold larger numbers, we 
can simply execute dummy instructions; it will be useful in later 
chapters.) 


one position to the right or left in each step. 


e We allow indexed access to arrays. If foo is a scalar and Bar is an 
array, then Bar[foo] refers to the location of Bar indexed by the 
value of foo. (Note that this means we don’t need to have a special 
index variable i anymore.) 


e As is often the case in programming languages, we will assume 
that for Boolean operations such as NAND, a zero valued integer is 
considered as false, and a non-zero valued integer is considered as 
true. 


e In addition to NAND, NAND-RAM also includes all the basic arith- 
metic operations of addition, subtraction, multiplication, (integer) 
division, as well as comparisons (equal, greater than, less than, 
etc..). 


e NAND-RAM includes conditional statements if/then as part of 
the language. 


e NAND-RAM contains looping constructs such as while and do as 
part of the language. 


A full description of the NAND-RAM programming language is 
in the appendix. However, the most important fact you need to know 
about NAND-RAM is that you actually don’t need to know much 
about NAND-RAM at all, since it is equivalent in power to Turing 
machines: 


Theorem 8.1 — Turing Machines (aka NAND-TM programs) and RAM ma- 

chines (aka NAND-RAM programs) are equivalent. For every function 

F : {0,1}* —> {0,1}*, F is computable by a NAND-TM program if 
and only if F is computable by a NAND-RAM program. 


Since NAND-IM programs are equivalent to Turing machines, and 
NAND-RAM programs are equivalent to RAM machines, Theorem 8.1 
shows that all these four models are equivalent to one another. 


Proof Idea: 

Clearly NAND-RAM is only more powerful than NAND-TM, and 
so if a function F is computable by a NAND-TM program then it can 
be computed by a NAND-RAM program. The challenging direction is 
to transform a NAND-RAM program P to an equivalent NAND-TM 
program Q. To describe the proof in full we will need to cover the full 
formal specification of the NAND-RAM language, and show how we 
can implement every one of its features as syntactic sugar on top of 
NAND-TM. 

This can be done but going over all the operations in detail is rather 
tedious. Hence we will focus on describing the main ideas behind this 
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Indexed 
wens [cpr] eee [o>] ze [E> 
Figure 8.4: Overview of the steps in the proof of The- 
orem 8.1 simulating NANDRAM with NANDTM. 

We first use the inner loop syntactic sugar of Sec- 

tion 7.4.1 to enable loading an integer from an array 
to the index variable i of NANDTM. Once we can do 
that, we can simulate indexed access in NANDTM. We 
then use an embedding of N? in N to simulate two 
dimensional bit arrays in NANDTM. Finally, we use 
the binary representation to encode one-dimensional 
arrays of integers as two dimensional arrays of bits 
hence completing the simulation of NANDRAM with 
NANDIM. 
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transformation. (See also Fig. 8.4.) NAND-RAM generalizes NAND- 
TM in two main ways: (a) adding indexed access to the arrays (ie.., 
Foo[bar] syntax) and (b) moving from Boolean valued variables to 
integer valued ones. The transformation has two steps: 


1. Indexed access of bit arrays: We start by showing how to handle (a). 
Namely, we show how we can implement in NAND-IM the op- 
eration Setindex (Bar) such that if Bar is an array that encodes 
some integer j, then after executing Setindex (Bar) the value of 
i will equal to j. This will allow us to simulate syntax of the form 
Foo[Bar ] by Setindex (Bar) followed by Foo[i]. 


2. Two dimensional bit arrays: We then show how we can use “syntactic 
sugar” to augment NAND-IM with two dimensional arrays. That is, 
have two indices i and j and two dimensional arrays, such that we can 
use the syntax Foo[i][j] to access the (i,j)-th location of Foo. 


3. Arrays of integers: Finally we will encode a one dimensional array 
Arr of integers by a two dimensional Arrbin of bits. The idea is 
simple: if a; o, ...,a;,¢ is a binary (prefix-free) representation of 
Arr[i], then Arrbin[7z][j] will be equal to a; ;. 


Once we have arrays of integers, we can use our usual syntactic 
sugar for functions, GOTO etc. to implement the arithmetic and control 
flow operations of NAND-RAM. 

* 


The above approach is not the only way to obtain a proof of Theo- 
rem 8.1, see for example Exercise 8.1 


Q 


8.2 THE GORY DETAILS (OPTIONAL) 


We do not show the full formal proof of Theorem 8.1 but focus on the 
most important parts: implementing indexed access, and simulating 
two dimensional arrays with one dimensional ones. Even these are 
already quite tedious to describe, as will not be surprising to anyone 
that has ever written a compiler. Hence you can feel free to merely 
skim this section. The important point is not for you to know all de- 
tails by heart but to be convinced that in principle it is possible to 
transform a NAND-RAM program to an equivalent NAND-TM pro- 
gram, and even be convinced that, with sufficient time and effort, you 
could do it if you wanted to. 


8.2.1 Indexed access in NAND-TM 
In NAND-IM we can only access our arrays in the position of the in- 
dex variable i, while NAND-RAM has integer-valued variables and 
can use them for indexed access to arrays, of the form Foo[bar]. To im- 
plement indexed access in NAND-IM, we will encode integers in our 
arrays using some prefix-free representation (see Section 2.5.2)), and 
then have a procedure Setindex (Bar) that sets i to the value encoded 
by Bar. We can simulate the effect of Foo[Bar] using Setindex (Bar) 
followed by Foo[il. 

Implementing Setindex (Bar) can be achieved as follows: 
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1. We initialize an array Atzero such that Atzero[0]= 1 and 
Atzero[j]= 0 for all j > 0. (This can be easily done in NAND-TM 
as all uninitialized variables default to zero.) 


2. Set i to zero, by decrementing it until we reach the point where 
AtzeroLi]= 1. 


3. Let Temp be an array encoding the number 0. 


4. We use GOTO to simulate an inner loop of the form: while Temp + 
Bar, increment Temp. 


5. At the end of the loop, i is equal to the value encoded by Bar. 


In NAND-IM code (using some syntactic sugar), we can imple- 
ment the above operations as follows: 


# assume Atzero is an array such that Atzero[@]=1 
# and Atzero[jJ=0 for all j>@ 


# set i to @. 
LABEL ("zero_idx") 
dir® = zero 

dirl 
# corresponds to i <- i-1 
GOTO("zero_idx",NOT(AtzeroLi])) 


one 


# zero out temp 

#(code below assumes a specific prefix-free encoding in 
ə Which 10 is the "end marker") 

Temp[@] = 1 

Temp[1] () 

# set i to Bar, assume we know how to increment, compare 
LABEL ("increment_temp") 

cond = EQUAL(Temp, Bar) 

dir® = one 


dirl = one 

# corresponds to i <- it! 

INC (Temp) 

GOTO("increment_temp", cond) 

# if we reach this point, i is number encoded by Bar 


# final instruction of program 
MODANDJUMP (dir@,dir1) 


8.2.2 Two dimensional arrays in NAND-TM 
To implement two dimensional arrays, we want to embed them in a 
one dimensional array. The idea is that we come up with a one to one 
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function embed : N x N — N, and so embed the location (i, j) of the 
two dimensional array Two in the location embed (i, j) of the array One. 
Since the set N x N seems “much bigger” than the set N, a priori it 
might not be clear that such a one to one mapping exists. However, 
once you think about it more, it is not that hard to construct. For ex- 
ample, you could ask a child to use scissors and glue to transform a 
10” by 10” piece of paper into a 1” by 100” strip. This is essentially 
a one to one map from [10] x [10] to [100]. We can generalize this to 
obtain a one to one map from [n] x [n] to [n”] and more generally a one 
to one map from N x N to N. Specifically, the following map embed 
would do (see Fig. 8.5): 


embed(x,y) = l(a Fy)\(ecty+1)+a. 


Exercise 8.3 asks you to prove that embed is indeed one to one, as ; P IE 7 L ; | > BE a 
well as computable by a NAND-IM program. (The latter can be done 1 ae íd D6 
by simply following the grade-school algorithms for multiplication, E ae 4 
addition, and division.) This means that we can replace code of the E 5 CAO ; 5 
form Two[Foo][Bar] = something (i.e., access the two dimensional ; e 
array Two at the integers encoded by the one dimensional arrays Foo 3 a 
and Bar) by code of the form: z 3l 1 x 
Blah = embed(Foo,Bar) ee kiai H = 
Setindex (Blah) sed nee SA distinct pairs A y) end (a, y’, 
Two[i] = something embed(x, y) # embed(x', y’). 


8.2.3 All the rest 

Once we have two dimensional arrays and indexed access, simulating 
NAND-RAM with NAND-IM is just a matter of implementing the 
standard algorithms for arithmetic operations and comparisons in 
NAND-IM. While this is cumbersome, it is not difficult, and the end 
result is to show that every NAND-RAM program P can be simulated 
by an equivalent NAND-IM program Q, thus completing the proof of 
Theorem 8.1. 


(R) 
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8.3 TURING EQUIVALENCE (DISCUSSION) 


Any of the standard programming languages such as C, Java, Python, 
Pascal, Fortran have very similar operations to NAND-RAM. (In- 
deed, ultimately they can all be executed by machines which have a 
fixed number of registers and a large memory array.) Hence using 
Theorem 8.1, we can simulate any program in such a programming 
language by a NAND-IM program. In the other direction, it is a fairly 
easy programming exercise to write an interpreter for NAND-IM in 
any of the above programming languages. Hence we can also simulate 
NAND-TM programs (and so by Theorem 7.11, Turing machines) us- 
ing these programming languages. This property of being equivalent 
in power to Turing machines / NAND-IM is called Turing Equivalent 
(or sometimes Turing Complete). Thus all programming languages we 
are familiar with are Turing equivalent. 


8.3.1 The “Best of both worlds” paradigm 
The equivalence between Turing machines and RAM machines allows 
us to choose the most convenient language for the task at hand: 


e When we want to prove a theorem about all programs/algorithms, 
we can use Turing machines (or NAND-TM) since they are sim- 


222222222222 
EEE EEEE E 


sesa 
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Figure 8.6: A punched card corresponding to a Fortran 
statement. 


1 Some programming languages have fixed (even if 
extremely large) bounds on the amount of memory 
they can access, which formally prevent them from 
being applicable to computing infinite functions and 
hence simulating Turing machines. We ignore such 
issues in this discussion and assume access to some 
storage device without a fixed upper bound on its 


capacity. 


pler and easier to analyze. In particular, if we want to show that 
a certain function cannot be computed, then we will use Turing 
machines. 


e When we want to show that a function can be computed we can use 
RAM machines or NAND-RAM, because they are easier to pro- 
gram in and correspond more closely to high level programming 
languages we are used to. In fact, we will often describe NAND- 
RAM programs in an informal manner, trusting that the reader 
can fill in the details and translate the high level description to the 
precise program. (This is just like the way people typically use in- 
formal or “pseudocode” descriptions of algorithms, trusting that 
their audience will know to translate these descriptions to code if 
needed.) 


Our usage of Turing machines / NAND-TM and RAM Machines 
/ NAND-RAM is very similar to the way people use in practice high 
and low level programming languages. When one wants to produce 
a device that executes programs, it is convenient to do so for a very 
simple and “low level” programming language. When one wants to 


describe an algorithm, it is convenient to use as high level a formalism 
as possible. 


8.3.2 Let’s talk about abstractions 
“The programmer is in the unique position that ... he has to be able 
to think in terms of conceptual hierarchies that are much deeper than 
a single mind ever needed to face before.”, Edsger Dijkstra, “On the 
cruelty of really teaching computing science”, 1988. 


At some point in any theory of computation course, the instructor 
and students need to have the talk. That is, we need to discuss the level 
of abstraction in describing algorithms. In algorithms courses, one 
typically describes algorithms in English, assuming readers can “fill 
in the details” and would be able to convert such an algorithm into an 
implementation if needed. For example, Algorithm 8.4 is a high level 
description of the breadth first search algorithm. 
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Figure 8.7: By having the two equivalent languages 
NAND-IM and NAND-RAM, we can “have our cake 
and eat it too”, using NAND-TM when we want to 
prove that programs can’t do something, and using 
NAND-RAM or other high level languages when we 
want to prove that programs can do something. 
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If we wanted to give more details on how to implement breadth 
first search in a programming language such as Python or C (or 
NAND-RAM / NAND-IM for that matter), we would describe how 
we implement the queue data structure using an array, and similarly 


how we would use arrays to mark vertices. We call such an “interme- 
diate level” description an implementation level or pseudocode descrip- 
tion. Finally, if we want to describe the implementation precisely, we 
would give the full code of the program (or another fully precise rep- 
resentation, such as in the form of a list of tuples). We call this a formal 
or low level description. 


High level description: “Output smallest element in list L” 


OO BO RE OE EF 


Pseudocode / List L is n numbers each encoded as string in {0,1}° 

: f 1. Let smallest = 2°++ 

implementation level 2. Fori =0..n—1: smallest = min(L[i], smallest) 
description: 3. Output smallest 


PALO OO A 


Temp[0] = NAND(X[0],X[0]) 
Temp[1] = NAND(X[0],Temp[0]) 
Low level fully formal Temp[2] = NAND(X[0],Temp[1]) 
een Temp[3] = NAND(X[0],Temp[2]) 
description: Temp[4] = 


NAND(Temp[1] , Temp[2] ) 


While we started off by describing NAND-CIRC, NAND-IM, and 
NAND-RAM programs at the full formal level, as we progress in this 


Figure 8.8: We can describe an algorithm at different 
levels of granularity /detail and precision. At the 
highest level we just write the idea in words, omitting 
all details on representation and implementation. 

In the intermediate level (also known as implemen- 
tation or pseudocode) we give enough details of the 
implementation that would allow someone to de- 
rive it, though we still fall short of providing the full 
code. The lowest level is where the actual code or 
mathematical description is fully spelled out. These 
different levels of detail all have their uses, and mov- 
ing between them is one of the most important skills 
for a computer scientist. 
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book we will move to implementation and high level description. 
After all, our goal is not to use these models for actual computation, 
but rather to analyze the general phenomenon of computation. That 
said, if you don’t understand how the high level description translates 
to an actual implementation, going “down to the metal” is often an 
excellent exercise. One of the most important skills for a computer 
scientist is the ability to move up and down hierarchies of abstractions. 

A similar distinction applies to the notion of representation of objects 
as strings. Sometimes, to be precise, we give a low level specification 
of exactly how an object maps into a binary string. For example, we 
might describe an encoding of n vertex graphs as length n? binary 
strings, by saying that we map a graph G over the vertices [n] to a 
string x € {0,1}" such that the n - i + j-th coordinate of « is 1 if and 
only if the edge 7 7 is present in G. We can also use an intermediate or 
implementation level description, by simply saying that we represent a 
graph using the adjacency matrix representation. 

Finally, because we are translating between the various represen- 
tations of graphs (and objects in general) can be done via a NAND- 
RAM (and hence a NAND-TM) program, when talking in a high level 
we also suppress discussion of representation altogether. For example, 
the fact that graph connectivity is a computable function is true re- 
gardless of whether we represent graphs as adjacency lists, adjacency 
matrices, list of edge-pairs, and so on and so forth. Hence, in cases 
where the precise representation doesn’t make a difference, we would 
often talk about our algorithms as taking as input an object X (that 
can be a graph, a vector, a program, etc.) without specifying how X is 
encoded as a string. 


Defining “Algorithms”. Up until now we have used the term “algo- 
rithm” informally. However, Turing machines and the range of equiv- 
alent models yield a way to precisely and formally define algorithms. 
Hence whenever we refer to an algorithm in this book, we will mean 
that it is an instance of one of the Turing equivalent models, such as 
Turing machines, NAND-TM, RAM machines, etc. Because of the 
equivalence of all these models, in many contexts, it will not matter 
which of these we use. 


8.3.3 Turing completeness and equivalence, a formal definition (optional) 
A computational model is some way to define what it means for a pro- 
gram (which is represented by a string) to compute a (partial) func- 
tion. A computational model M is Turing complete if we can map every 
Turing machine (or equivalently NAND-TM program) N into a pro- 
gram P for M that computes the same function as N. It is Turing 
equivalent if the other direction holds as well (i.e., we can map every 
program in M to a Turing machine that computes the same function). 
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We can define this notion formally as follows. (This formal definition 
is not crucial for the remainder of this book so feel to skip it as long 
as you understand the general concept of Turing equivalence; This 
notion is sometimes referred to in the literature as Gödel numbering 
or admissible numbering.) 


Definition 8.5 — Turing completeness and equivalence (optional). Let F be 
the set of all partial functions from {0, 1}* to {0,1}*. A computa- 
tional model is a map M : {0,1}* > F. 

We say that a program P € {0,1}* M-computes a function F € F 
if M(P) =F. 

A computational model M is Turing complete if there is a com- 
putable map ENCODE,, : {0,1}* — {0,1}* for every Turing 
machine N (represented as a string), 1f(ENCODE,,(N)) is equal 
to the partial function computed by N. 

A computational model M is Turing equivalent if it is Tur- 
ing complete and there exists a computable map DECODE pr 
{0,1}* — {0,1}* such that or every string P € {0,1}*,N = 
DECODE (P) is a string representation of a Turing machine that 
computes the function M (P). 


Some examples of Turing equivalent models (some of which we 
have already seen, and some are discussed below) include: 


Turing machines 
NAND-IM programs 
NAND-RAM programs 
A calculus 


Game of life (mapping programs and inputs/outputs to starting 
and ending configurations) 

e Programming languages such as Python/C/Javascript/OCaml... 
(allowing for unbounded storage) 


8.4 CELLULAR AUTOMATA 


Many physical systems can be described as consisting of a large num- 
ber of elementary components that interact with one another. One 
way to model such systems is using cellular automata. This is a system 
that consists of a large (or even infinite) number of cells. Each cell 
only has a constant number of possible states. At each time step, a cell 
updates to a new state by applying some simple rule to the state of 
itself and its neighbors. 

A canonical example of a cellular automaton is Conway’s Game 
of Life. In this automata the cells are arranged in an infinite two di- 
mensional grid. Each cell has only two states: “dead” (which we can 
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encode as 0 and identify with Ø) or “alive” (which we can encode 

as 1). The next state of a cell depends on its previous state and the 
states of its 8 vertical, horizontal and diagonal neighbors (see Fig. 8.9). 
A dead cell becomes alive only if exactly three of its neighbors are 
alive. A live cell continues to live if it has two or three live neighbors. 
Even though the number of cells is potentially infinite, we can en- 
code the state using a finite-length string by only keeping track of the 
live cells. If we initialize the system in a configuration with a finite 
number of live cells, then the number of live cells will stay finite in all 
future steps. The Wikipedia page for the Game of Life contains some 
beautiful figures and animations of configurations that produce very 
interesting evolutions. 


2 dimensional cellular automaton: 
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1 dimensional cellular automaton: 


| 9 | 2 c [a]. ajo] 


Since the cells in the game of life are are arranged in an infinite two- 
dimensional grid, it is an example of a two dimensional cellular automa- 
ton. We can also consider the even simpler setting of a one dimensional 
cellular automaton, where the cells are arranged in an infinite line, see 
Fig. 8.10. It turns out that even this simple model is enough to achieve 
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Figure 8.9: Rules for Conway’s Game of Life. Image 
from this blog post. 


Figure 8.10: In a two dimensional cellular automaton 
every cell is in position 7, j for some integers i, j € Z. 
The state of a cell is some value A; ; € X for some 
finite alphabet X. At a given time step, the state of the 
cell is adjusted according to some function applied to 
the state of (i, j) and all its neighbors (i + 1,7 + 1). 
In a one dimensional cellular automaton every cell is in 
position i € Z and the state A; of i at the next time 
step depends on its current state and the state of its 
two neighbors i — 1 andi + 1. 


296 INTRODUCTION TO THEORETICAL COMPUTER SCIENCE 


Turing-completeness. We will now formally define one-dimensional 
cellular automata and then prove their Turing completeness. 


Definition 8.6 — One dimensional cellular automata. Let X be a finite set 
containing the symbol Ø. A one dimensional cellular automaton over 
alphabet X is described by a transition ruler : X? — %, which 
satisfies r(@, Ø, Ø) = Ø. 

A configuration of the automaton r is a function A : Z > X. If 
an automaton with rule r is in configuration A, then its next config- 
uration, denoted by A^ = NEXT,(A), is the function A’ such that 
A’(i) = r(A(i — 1), A(z), AG + 1)) for every i € Z. In other words, 
the next state of the automaton r at point i is obtained by applying 
the rule r to the values of A at į and its two neighbors. 


Finite configuration. We say that a configuration of an automaton r 

is finite if there is only some finite number of indices 79, ...,7;_; in Z 
such that A(i,;) # Ø. (That is, for every i ¢ {ig,...,t;,}, A(t) = Ø.) 
Such a configuration can be represented using a finite string that 
encodes the indices ig, ... ,7,,_, and the values A(i,),..., A(z,,_,). Since 
R(@,@,®) = Ø, if A is a finite configuration then NEXT,,(A) is finite 
as well. We will only be interested in studying cellular automata that 
are initialized in finite configurations, and hence remain in a finite 


configuration throughout their evolution. 


8.4.1 One dimensional cellular automata are Turing complete 
We can write a program (for example using NAND-RAM) that sim- 
ulates the evolution of any cellular automaton from an initial finite 
configuration by simply storing the values of the cells with state not 
equal to @ and repeatedly applying the rule r. Hence cellular au- 
tomata can be simulated by Turing machines. What is more surprising 
that the other direction holds as well. For example, as simple as its 
rules seem, we can simulate a Turing machine using the game of life 
(see Fig. 8.11). 

In fact, even one dimensional cellular automata can be Turing com- 
plete: 


Theorem 8.7 — One dimensional automata are Turing complete. For every 
Turing machine M, there is a one dimensional cellular automaton 
that can simulate M on every input z. 


To make the notion of “simulating a Turing machine” more precise 
we will need to define configurations of Turing machines. We will 
do so in Section 8.4.2 below, but at a high level a configuration of a 
Turing machine is a string that encodes its full state at a given step in 


Figure 8.11: A Game-of-Life configuration simulating 
a Turing machine. Figure by Paul Rendell. 


EQUIVALENT MODELS OF COMPUTATION 297 


its computation. That is, the contents of all (non-empty) cells of its 
tape, its current state, as well as the head position. 

The key idea in the proof of Theorem 8.7 is that at every point in 
the computation of a Turing machine M, the only cell in M’s tape that 
can change is the one where the head is located, and the value this 
cell changes to is a function of its current state and the finite state of 
M. This observation allows us to encode the configuration of a Turing 
machine M as a finite configuration of a cellular automaton r, and 
ensure that a one-step evolution of this encoded configuration under 
the rules of r corresponds to one step in the execution of the Turing 
machine M. 


8.4.2 Configurations of Turing machines and the next-step function 

To turn the above ideas into a rigorous proof (and even statement!) 
of Theorem 8.7 we will need to precisely define the notion of config- 
urations of Turing machines. This notion will be useful for us in later 
chapters as well. 


Head 


position: i € N Figure 8.12: A configuration of a Turing machine M 


with alphabet © and state space [k] encodes the state 
of M ata particular step in its execution as a string a 


D || oz | o o4 | os 56 | a oa | 09 Hq G13] 012| 013) O14] O @ [ø (0) [ø Ø |Ø |ø 


v 

£ N over the alphabet © = © x ({-} x [k]). The string is 

£ Local State: o) EEE: f of length t where t is such that M’s tape contains Ø in 

2 — i all positions t and larger and M’s head is in a position 

E smaller than t. If M’s head is in the i-th position, then 
for j # i, œj encodes the value of the j-th cell of M’s 

String over alphabet £ x ({-} U [k]) encoding configuration: tape, while a; encodes both this value as well as the 
current state of M. If the machine writes the value 7, 

Dr | Gor | ar | a | Br | a | Oey || Ser A | Osr | or | Aror | TuS) Tzr | Asr | Aar changes state to t, and moves right, then in the next 

configuration will contain at position i the value (7, -) 

Configuration in the next step: and at position i + 1 the value (a1, t). 


Dy Oor | Or | G2" | 93° | Oar | Osr | Osr | O77 | 9%, | Oor | Cios [e Ja] 013r | Oiar 


Definition 8.8 — Configuration of Turing Machines.. Let M be a Turing ma- 
chine with tape alphabet © and state space [|k]. A configuration of M 
is a string a € = where 5 = E x ({-} U [k]) that satisfies that there 
is exactly one coordinate i for which a; = (ø, s) for some o € ¥ and 
s € [k]. For all other coordinates j, a; = (o’,-) for some o’ € X. 

A configuration a € = of M corresponds to the following state 
of its execution: 


e M’s tape contains a; 9 forall j <  |a| and contains Ø for all po- 
sitions that are at least |a|, where we let a; ọ be the value o such 
thata; = (0,t)witho € Landt € {-} U [k]. (In other words, 
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since q; is a pair of an alphabet symbol o and either a state in [k] 
or the symbol -, a; o is the first component ø of this pair.) 


e M’s head is in the unique position i for which a, has the form 
(o,s) for s € [k], and M’s state is equal to s. 


Definition 8.8 is a little cumbersome, but ultimately a configuration 
is simply a string that encodes a snapshot of the Turing machine at a 
given point in the execution. (In operating-systems lingo, it is a “core 
dump”.) Such a snapshot needs to encode the following components: 


1. The current head position. 
2. The full contents of the large scale memory, that is the tape. 


3. The contents of the “local registers”, that is the state of the ma- 
chine. 


The precise details of how we encode a configuration are not impor- 
tant, but we do want to record the following simple fact: 


Lemma 8.9 Let M be a Turing machine and let NEXT), : yo 
be the function that maps a configuration of M to the configuration 
at the next step of the execution. Then for every i € N, the value of 
NEXT),(a@), only depends on the coordinates a,_,,0;, 0,1. 


(For simplicity of notation, above we use the convention that if i 
is “out of bounds”, such asi < Oori > |a|, then we assume that 
a, = (@,-).) We leave proving Lemma 8.9 as Exercise 8.7. The idea 
behind the proof is simple: if the head is neither in position i nor 
positions į — 1 and i + 1, then the next-step configuration at i will be 
the same as it was before. Otherwise, we can “read off” the state of the 


Turing machine and the value of the tape at the head location from the 
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configuration at 7 or one of its neighbors and use that to update what 
the new state at i should be. Completing the full proof is not hard, 
but doing it is a great way to ensure that you are comfortable with the 
definition of configurations. 


Completing the proof of Theorem 8.7.. We can now restate Theorem 8.7 
more formally, and complete its proof: 


Theorem 8.10 — One dimensional automata are Turing complete (formal state- 
ment). For every Turing machine M, if we denote by © the alphabet 
of its configuration strings, then there is a one-dimensional cellular 
automaton r over the alphabet =" such that 


(NEXT ,,(a)) = NEXT, (a) 


for every configurationa € = of M (again using the convention 
that we consider a, = Ø if i is “out of bounds”). 


Proof. We consider the element (Ø, -) of © to correspond to the Ø 
element of the automaton r. In this case, by Lemma 8.9, the function 
NEXT y that maps a configuration of M into the next one is in fact a 
valid rule for a one dimensional automata. 


The automaton arising from the proof of Theorem 8.10 has a large 
alphabet, and furthermore one whose size that depends on the ma- 
chine M that is being simulated. It turns out that one can obtain an 


automaton with an alphabet of fixed size that is independent of the 
program being simulated, and in fact the alphabet of the automaton 
can be the minimal set {0, 1}! See Fig. 8.13 for an example of such an 
Turing-complete automaton. 


Figure 8.13: Evolution of a one dimensional automata. 
Each row in the figure corresponds to the configura- 
tion. The initial configuration corresponds to the top 
row and contains only a single “live” cell. This figure 
corresponds to the “Rule 110” automaton of Stephen 
Wolfram which is Turing Complete. Figure taken 
from Wolfram MathWorld. 
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8.5 LAMBDA CALCULUS AND FUNCTIONAL PROGRAMMING LAN- 
GUAGES 


The à calculus is another way to define computable functions. It was 
proposed by Alonzo Church in the 1930’s around the same time as 
Alan Turing’s proposal of the Turing machine. Interestingly, while 
Turing machines are not used for practical computation, the A calculus 
has inspired functional programming languages such as LISP, ML and 
Haskell, and indirectly the development of many other programming 
languages as well. In this section we will present the A calculus and 
show that its power is equivalent to NAND-IM programs (and hence 
also to Turing machines). Our Github repository contains a Jupyter 
notebook with a Python implementation of the à calculus that you can 
experiment with to get a better feel for this topic. 


The A operator. At the core of the A calculus is a way to define “anony- 
mous” functions. For example, instead of giving a name f to a func- 
tion and defining it as 


f(a)=axu 


we can write it as 


ALL XH 


and so (Ax.x x x)(7) = 49. That is, you can think of Ax.exp(x), 
where exp is some expression as a way of specifying the anonymous 
function x > exp(x). Anonymous functions, using either Ax. f(x), z œ 
f(x) or other closely related notation, appear in many programming 
languages. For example, in Python we can define the squaring function 
using lambda x: x*x while in JavaScript we can use x => X*x or 
(x) => xxx. In Scheme we would define it as (lambda (x) (* x x)). 
Clearly, the name of the argument to a function doesn’t matter, and so 
Ay-y X y is the same as Axv.x x x, as both correspond to the squaring 
function. 

Dropping parentheses. To reduce notational clutter, when writing 
A calculus expressions we often drop the parentheses for function 
evaluation. Hence instead of writing f(x) for the result of applying 
the function f to the input x, we can also write this as simply f zx. 
Therefore we can write (Av.x x x)7 = 49. In this chapter, we will use 
both the f(x) and f x notations for function application. Function 
evaluations are associative and bind from left to right, and hence f g h 
is the same as (fg)h. 


8.5.1 Applying functions to functions 
A key feature of the A calculus is that functions are “first-class objects” 
in the sense that we can use functions as arguments to other functions. 


For example, can you guess what number is the following expression 
equal to? 


CAF- Ay- (F y) Ae x @)) 3) (8.1) 


Let’s evaluate (8.1) one step at a time. As nice as it is for the A 


calculus to allow anonymous functions, adding names can be very 
helpful for understanding complicated expressions. So, let us write 


F = Af .(ày.(f(fy))) and g = Au.a x z. 
Therefore (8.1) becomes 


((F g) 3). 


On input a function f, F outputs the function Ay.(f(f y)), or in 
other words F'f is the function y œ f(f(y)). Our function g is simply 
g(a) = x? and so (Fg) is the function that maps y to (y?)? = y*. Hence 
((Fg)3) = 3° = 81. 


Solved Exercise 8.1 What number does the following expression evalu- 
ate to? 


((Ax.(Ay.x)) 2) 9. (8.2) 


Solution: 

Ay.« is the function that on input y ignores its input and outputs 
x. Hence (Az.(Ay.x))2 yields the function y > 2 (or, using À nota- 
tion, the function Ay.2). Hence (8.2) is equivalent to (Ay.2)9 = 2. 


8.5.2 Obtaining multi-argument functions via Currying 
Ina A expression of the form Ax.e, the expression e can itself involve 
the operator. Thus for example the function 


Ax.(Ay.2 +y) (8.3) 


maps z to the function y > x + y. 
In particular, if we invoke the function (8.3) on a to obtain some 
function f, and then invoke f on b, we obtain the value a + b. We 
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can see that the one-argument function (8.3) corresponding to a > 

(b ty a + b) can also be thought of as the two-argument function 
(a,b) =œ a+ b. Generally, we can use the A expression Ax. (Ay. f(x, y)) 
to simulate the effect of a two argument function (x, y) œ> f(x,y). This 
technique is known as Currying. We will use the shorthand Ax, y.e 

for Ax.(Ay.e). If f = Ax.(Ay-.e) then (fa)b corresponds to applying fa 
and then invoking the resulting function on b, obtaining the result of 
replacing in e the occurrences of x with a and occurrences of b with 

y. By our rules of associativity, this is the same as (fab) which we'll 
sometimes also write as f (a,b). 


8.5.3 Formal description of the A calculus 

We now provide a formal description of the A calculus. We start with 
“basic expressions” that contain a single variable such as x or y and 
build more complex expressions of the form (e e’) and Ax.e where e, e’ 
are expressions and 7v is a variable idenifier. Formally A expressions 
are defined as follows: 


Definition 8.12 — A expression.. A A expression is either a single variable 
identifier or an expression e of the one of the following forms: 


e Application: e = (e’ e”), where e’ and e” are A expressions. 


e Abstraction: e = \z.(e’) where e’ is a A expression. 


Definition 8.12 is a recursive definition since we defined the concept 
of A expressions in terms of itself. This might seem confusing at first, 
but in fact you have known recursive definitions since you were an 
elementary school student. Consider how we define an arithmetic 
expression: it is an expression that is either just a number, or has one of 
the forms (e + e’), (e — e’), (e x e’), or (e +e’), where e and e’ are other 
arithmetic expressions. 

Free and bound variables. Variables in a à expression can either be 
free or bound to a \ operator (in the sense of Section 1.4.7). In a single- 
variable A expression var, the variable var is free. The set of free and 
bound variables in an application expression e = (e’ e”) is the same 
as that of the underlying expressions e’ and e”. In an abstraction ex- 
pression e = Avar.(e’), all free occurences of var in e’ are bound to 
the A operator of e. If you find the notion of free and bound variables 
confusing, you can avoid all these issues by using unique identifiers 
for all variables. 

Precedence and parentheses. We will use the following rules to allow 
us to drop some parentheses. Function application associates from left 
to right, and so fgh is the same as (fg)h. Function application has a 
higher precedence than the A operator, and so Ax. fgx is the same as 


dx.(d5- 509) 


a f 
ths £5) 
7 


Figure 8.14: In the “currying” transformation, we can 
create the effect of a two parameter function f(x, y) 
with the à expression Ax.(Ay. f(x, y)) which on input 
x outputs a one-parameter function f, that has x 
“hardwired” into it and such that f,(y) = f(x,y). 
This can be illustrated by a circuit diagram; see 
Chelsea Voss’s site. 
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Az.((fg)z). This is similar to how we use the precedence rules in arith- 
metic operations to allow us to use fewer parentheses and so write the 
expression (7 x 3) + 2as7 x 3+ 2. As mentioned in Section 8.5.2, we 
also use the shorthand Xz, y.e for \x.(Ay.e) and the shorthand f(x,y) 
for (f x) y. This plays nicely with the “Currying” transformation of 
simulating multi-input functions using A expressions. 


Equivalence of A expressions. As we have seen in Solved Exercise 8.1, 
the rule that (Av.exp)exp’ is equivalent to exp|x — exp’] enables us 
to modify A expressions and obtain simpler equivalent form for them. 
Another rule that we can use is that the parameter does not matter 
and hence for example Ay.y is the same as \z.z. Together these rules 
define the notion of equivalence of A expressions: 


Definition 8.13 — Equivalence of A expressions. Two A expressions are 
equivalent if they can be made into the same expression by repeated 
applications of the following rules: 


1. Evaluation (aka 6 reduction): The expression (Az.exp)exp’ is 
equivalent to exp|[x > exp’). 


2. Variable renaming (aka a conversion): The expression \x.exp 
is equivalent to Ay.exp[x — y]. 


If exp is a A expression of the form \z.exp’ then it naturally corre- 
sponds to the function that maps any input z to exp’|x — z]. Hence 
the A calculus naturally implies a computational model. Since in the A 
calculus the inputs can themselves be functions, we need to decide in 
what order we evaluate an expression such as 


(Aw. f)(Ay-gz) - (8.4) 


There are two natural conventions for this: 


e Call by name (aka “lazy evaluation”): We evaluate (8.4) by first plug- 
ging in the right-hand expression (Ay.gz) as input to the left-hand 
side function, obtaining f[z — (Ay.gz)] and then continue from 
there. 


e Call by value (aka “eager evaluation”): We evaluate (8.4) by first 
evaluating the right-hand side and obtaining h = g[y — z], and then 
plugging this into the left-hand side to obtain f[x — h]. 


Because the A calculus has only pure functions, that do not have 
“side effects”, in many cases the order does not matter. In fact, it can 
be shown that if we obtain a definite irreducible expression (for ex- 
ample, a number) in both strategies, then it will be the same one. 


303 


304 INTRODUCTION TO THEORETICAL COMPUTER SCIENCE 


However, for concreteness we will always use the “call by name” (i.e., 
lazy evaluation) order. (The same choice is made in the programming 
language Haskell, though many other programming languages use 
eager evaluation.) Formally, the evaluation of a à expression using 
“call by name” is captured by the following process: 


Definition 8.14 — Simplification of A expressions. Let e be a à expres- 
sion. The simplification of e is the result of the following recursive 
process: 


1. Ife is a single variable x then the simplification of e is e. 


2. Ifehasthe forme = Az.e’ then the simplification of e is Ax. f’ 
where f’ is the simplification of e’. 


3. (Evaluation / 8 reduction.) If e has the forme = (Az.e’ e”) then 
the simplification of e is the simplification of e’[z — e”], which 
denotes replacing all copies of x in e’ bound to the À operator 
with e” 


4. (Renaming / a conversion.) The canonical simplification of e is 
obtained by taking the simplification of e and renaming the vari- 
ables so that the first bound variable in the expression is vo, the 
second one is v,, and so on and so forth. 


We say that two A expressions e and e’ are equivalent, denoted by 
e S e’, if they have the same canonical simplification. 


Solved Exercise 8.2 — Equivalence of A expressions. Prove that the following 
two expressions e and f are equivalent: 


e=AL.u 


f = (aa.(Ab.b))(Az.22) 


Solution: 

The canonical simplification of e is simply Avg.v9. To do the 
canonical simplification of f we first use 8 reduction to plug in 
àz.zz instead of a in (Ab.b) but since a is not used in this function at 
all, we simply obtained Ab.b which simplifies to \vg.vg as well. 
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8.5.4 Infinite loops in the A calculus 

Like Turing machines and NAND-IM programs, the simplification 
process in the A calculus can also enter into an infinite loop. For exam- 
ple, consider the à expression 


Ax.£x AL.LX (8.5) 


If we try to simplify (8.5) by invoking the left-hand function on the 
right-hand one, then we get another copy of (8.5) and hence this never 
ends. There are examples where the order of evaluation can matter for 
whether or not an expression can be simplified, see Exercise 8.9. 


8.6 THE “ENHANCED” A CALCULUS 


We now discuss the A calculus as a computational model. We will 
start by describing an “enhanced” version of the A calculus that con- 
tains some “superfluous features” but is easier to wrap your head 
around. We will first show how the enhanced A calculus is equiva- 
lent to Turing machines in computational power. Then we will show 
how all the features of “enhanced A calculus” can be implemented as 
“syntactic sugar” on top of the “pure” (i.e., non-enhanced) A calculus. 
Hence the pure A calculus is equivalent in power to Turing machines 
(and hence also to RAM machines and all other Turing-equivalent 
models). 

The enhanced A calculus includes the following set of objects and 
operations: 


e Boolean constants and IF function: There are A expressions 0, 1 
and IF that satisfy the following conditions: for every A expression 
eand f,IF 1 e f = e and IF 0 e f = f. That is, IF is the function that 
given three arguments a, e, f outputs e if a = 1 and f if a = 0. 


e Pairs: There is a A expression PAIR which we will think of as the 
pairing function. For every A expressions e, f, PAIR e fis the 
pair (e, f) that contains e as its first member and f as its second 
member. We also have A expressions HEAD and TAIL that extract 
the first and second member of a pair respectively. Hence, for every 
à expressions e, f, HEAD (PAIR e f) =e and TAIL (PAIR e f) = f? 


: f . N traditionally called cons, car and cdr. 
e Lists and strings: There is A expression NIL that corresponds to 


the empty list, which we also denote by (L). Using PAIR and NIL 
we construct lists. The idea is that if L is a k element list of the 
form (e1, €9,..-,€,,-L) then for every À expression ey we can obtain 
the k + 1 element list (e9, €1, €92, ..., €p, L) using the expression 
PAIR eo L. For example, for every three à expressions e, f, g, the 
following corresponds to the three element list (e, f, g, L): 


PAIR e (PAIR f (PAIR g NIL)) . 


? In Lisp, the PAIR, HEAD and TAIL functions are 
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The A expression ISEMPTY returns 1 on NIL and returns 0 on every 
other list. A string is simply a list of bits. 


e List operations: The enhanced A calculus also contains the 
list-processing functions MAP, REDUCE, and FILTER. Given 
alist L = (zo, ..-, £n—1; L) anda function f, MAP L fap- 
plies f on every member of the list to obtain the new list 
L’ = (f(z0),-, f(£n-1), L). Given a list L as above and an 
expression f whose output is either 0 or 1, FILTER L f returns the 
list (£;) ¢,,-1 containing all the elements of L for which f outputs 
1. The function REDUCE applies a “combining” operation to a 
list. For example, REDUCE L + 0 will return the sum of all the 
elements in the list L. More generally, REDUCE takes a list L, an 
operation f (which we think of as taking two arguments) and a A 
expression z (which we think of as the “neutral element” for the 
operation f, such as 0 for addition and 1 for multiplication). The 
output is defined via 


z L=NIL 
REDUCE L f z = . 
f (HEAD L) (REDUCE (TAIL L) f z) otherwise 


See Fig. 8.16 for an illustration of the three list-processing operations. 


e Recursion: Finally, we want to be able to execute recursive func- 
tions. Since in A calculus functions are anonymous, we can’t write 
a definition of the form f(x) = blah where blah includes calls to 
f. Instead we use functions f that take an additional input me as a 
parameter. The operator RECURSE will take such a function f as 
input and return a “recursive version” of f where all the calls to me 
are replaced by recursive calls to this function. That is, if we have a 
function F taking two parameters me and x, then RECURSE F will 
be the function f taking one parameter x such that f(x) = F(f,2) 
for every x. 


Solved Exercise 8.3 — Compute NAND using A calculus. Give a A expression 
N such that N x y = NAND(z, y) for every x, y € {0,1}. 


Solution: 
The NAND of x,y is equal to 1 unless x = y = 1. Hence we can 
write 


N = Ax, y-IF(ax, IF(y,0, 1), 1) 
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Solved Exercise 8.4 — Compute XOR using A calculus. Give a A expression 
XOR such that for every list L = (x9, ...,£,_1, L} where x, € {0,1} for 


>”n—1) 


i € [n], XORL evaluates to X` x; mod 2. 


a 
Solution: 
First, we note that we can compute XOR of two bits as follows: 
NOT = Aa.IF(a, 0, 1) (8.6) 
and 
XOR, = Aa, b.IF(b, NOT(a), a) (8.7) 


(We are using here a bit of syntactic sugar to describe the func- 
tions. To obtain the A expression for XOR we will simply replace 
the expression (8.6) in (8.7).) Now recursively we can define the 


XOR of a list as follows: 
0 Li t 
XOR(L) = is empty 
XOR,(HEAD(L), XOR(TAIL(L))) otherwise 


This means that XOR is equal to 


RECURSE (Ame, L.IF(ISEMPTY(L),0,XOR,(HEAD L , me(TAIL L)))) . 


That is, XOR is obtained by applying the RECURSE operator 
to the function that on inputs me, L, returns 0 if ISEMPTY(L) and 
otherwise returns XOR, applied to HEAD(L) and to me(TAIL(L)). 
We could have also computed XOR using the REDUCE opera- 
tion, we leave working this out as an exercise to the reader. 


Figure 8.15: A list (£0, £4, £2) in the A calculus is con- 
structed from the tail up, building the pair (x, NIL), 
then the pair (x1, (£2, NIL)) and finally the pair 

(£o, (£1, (£2, NIL))). That is, a list is a pair where 
the first element of the pair is the first element of the 
list and the second element is the rest of the list. The 
figure on the left renders this “pairs inside pairs” 
construction, though it is often easier to think of a list 
as a “chain”, as in the figure on the right, where the 
second element of each pair is thought of as a link, 
pointer or reference to the remainder of the list. 
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8.6.1 Computing a function in the enhanced A calculus 

An enhanced A expression is obtained by composing the objects above 
with the application and abstraction rules. The result of simplifying a À 
expression is an equivalent expression, and hence if two expressions 
have the same simplification then they are equivalent. 


Definition 8.15 — Computing a function via A calculus. Let F : {0,1}* —> 


{0, 1}* 


We say that exp computes F if for every x € {0,1}*, 


exp(£o, ome »Un-1> L) = (Yo, oie »Ym—1) 1L) 


wheren = |z|, y = F(x), and m = |y|, and the notion of equiva- 
lence is defined as per Definition 8.14. 


8.6.2 Enhanced A calculus is Turing-complete 

The basic operations of the enhanced A calculus more or less amount 
to the Lisp or Scheme programming languages. Given that, it is per- 
haps not surprising that the enhanced A-calculus is equivalent to 
Turing machines: 


Theorem 8.16 — Lambda calculus and NAND-TM. For every function 
F : {0,1}* — {0,1}*, F is computable in the enhanced À calculus if 
and only if it is computable by a Turing machine. 


Proof Idea: 

To prove the theorem, we need to show that (1) if F is computable 
by aA calculus expression then it is computable by a Turing machine, 
and (2) if F is computable by a Turing machine, then it is computable 
by an enhanced A calculus expression. 

Showing (1) is fairly straightforward. Applying the simplification 
rules to a A expression basically amounts to “search and replace” 
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which we can implement easily in, say, NAND-RAM, or for that 
matter Python (both of which are equivalent to Turing machines in 
power). Showing (2) essentially amounts to simulating a Turing ma- 
chine (or writing a NAND-IM interpreter) in a functional program- 
ming language such as LISP or Scheme. We give the details below but 
how this can be done is a good exercise in mastering some functional 
programming techniques that are useful in their own right. 

* 


Proof of Theorem 8.16. We only sketch the proof. The “if” direction 

is simple. As mentioned above, evaluating à expressions basically 
amounts to “search and replace”. It is also a fairly straightforward 
programming exercise to implement all the above basic operations in 
an imperative language such as Python or C, and using the same ideas 
we can do so in NAND-RAM as well, which we can then transform to 
a NAND-IM program. 

For the “only if” direction we need to simulate a Turing machine 
using a A expression. We will do so by first showing for every Tur- 
ing machine M a A expression to compute the next-step function 
NEXT, : © — > that maps a configuration of M to the next one (see 
Section 8.4.2). 

A configuration of M is a string a € Ð for a finite set ©. We can 
encode every symbol o € © by a finite string {0, 1}’, and so we will 
encode a configuration a in the A calculus as a list (a9, Qy, ..., @m-1; L) 
where q; is an £-length string (i.e., an -length list of 0’s and 1’s) en- 
coding a symbol in ©. 

By Lemma 8.9, for everya € © , NEXT ņ(a); is equal to 
r(Qj_1, Qi, Qi41) for some finite functionr : © — X. Using our 
encoding of ¥ as {0, 1}‘, we can also think of r as mapping {0, 1}°“ to 
{0, 1}*. By Solved Exercise 8.3, we can compute the NAND function, 
and hence every finite function, including r, using the A calculus. 
Using this insight, we can compute NEXT), using the A calculus as 


follows. Given a list L encoding the configuration ag œ we 


m—17 


define the lists L „e, and L,,..., encoding the configuration a shifted 


prev 
by one step to the right and left respectively. The next configuration 
Lit], Lneztli]) where we let L’[i] denote 


a’ is defined as a; = r(L > next 


[i], 
prev 
the i-th element of L’. This can be computed by recursion (and hence 


using the enhanced A calculus’ RECURSE operator) as follows: 
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Once we can compute NEXT w, we can simulate the execution of 


M on input z using the following recursion. Define FINAL(a) to be 
the final configuration of M when initialized at configuration a. The 
function FINAL can be defined recursively as follows: 


FINAL(a) = f a is halting configuration l 
FINAL(NEXT y(&)) otherwise 

Checking whether a configuration is halting (i.e., whether it is 
one in which the transition function would output Halt) can be easily 
implemented in the \ calculus, and hence we can use the RECURSE 
to compute FINAL. If we let a? be the initial configuration of M on 
input z then we can obtain the output M(x) from FINAL(a?), hence 
completing the proof. 


ILL, TAILL 


neit) 


8.7 FROM ENHANCED TO PURE A CALCULUS 


While the collection of “basic” functions we allowed for the enhanced 
A calculus is smaller than what's provided by most Lisp dialects, com- 
ing from NAND-IM it still seems a little “bloated”. Can we make do 
with less? In other words, can we find a subset of these basic opera- 
tions that can implement the rest? 

It turns out that there is in fact a proper subset of the operations of 
the enhanced A calculus that can be used to implement the rest. That 
subset is the empty set. That is, we can implement all the operations 
above using the A formalism only, even without using 0’s and 1’s. It’s 


A's all the way down! 


Theorem 8.18 — Enhanced A calculus equivalent to pure A calculus.. There 
are A expressions that implement the functions 0,1,IF,PAIR, HEAD, 
TAIL, NIL, ISEMPTY, MAP, REDUCE, and RECURSE. 


The idea behind Theorem 8.18 is that we encode 0 and 1 them- 
selves as A expressions, and build things up from there. This is known 
as Church encoding, as it was originated by Church in his effort to 
show that the A calculus can be a basis for all computation. We will 
not write the full formal proof of Theorem 8.18 but outline the ideas 
involved in it: 


e We define 0 to be the function that on two inputs x, y outputs y, 
and 1 to be the function that on two inputs x, y outputs x. We use 
Currying to achieve the effect of two-input functions and hence 
0 = Aw.Ay.yand 1 = àz.ày.x. (This representation scheme is the 
common convention for representing false and true but there are 
many other alternative representations for 0 and 1 that would have 
worked just as well.) 


e The above implementation makes the IF function trivial: 
IF(cond, a, b) is simply cond a b since 0ab = band lab = a. We 
can write IF = \x.x to achieve IF(cond, a,b) = (((IFcond)a)b) = 
cond a b. 
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e To encode a pair (x, y) we will produce a function f, „ that has x 
and y “in its belly” and satisfies f, „g = gry for every function g. 
That is, PAIR = Az, y. (Ag.gvy). We can extract the first element of 
a pair p by writing p1 and the second element by writing p0, and so 
HEAD = Xp.p1 and TAIL = Xp.po. 


e We define NIL to be the function that ignores its input and always 
outputs 1. That is, NIL = Ax.1. The ISEMPTY function checks, 
given an input p, whether we get 1 if we apply p to the function 
zero = Ax, y.0 that ignores both its inputs and always outputs 0. For 
every valid pair of the form p = PAIR«xy, pzero = pxy = 0 while 
NILzero = 1. Formally, ISEMPTY = Ap.p(Az, y.0). 


(R) 


8.7.1 List processing 
Now we come to a bigger hurdle, which is how to implement 
MAP, FILTER, REDUCE and RECURSE in the pure A calculus. It 
turns out that we can build MAP and FILTER from REDUCE, and 
REDUCE from RECURSE. For example MAP(L, f) is the same as 
REDUCE(L, g, NIL) where g is the operation that on input x and y, 
outputs PAIR( f(x), y). (I leave checking this as a (recommended!) 
exercise for you, the reader.) 

We can define REDUCE(L, f, z) recursively, by setting 
REDUCE(NIL, f,z) = zand stipulating that given a non- 
empty list L, which we can think of as a pair (head, rest), 
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REDUCE(L, f,z) = f(head, REDUCE(rest, f,z))). Thus, we 
might try to write a recursive A expression for REDUCE as follows 


REDUCE = AL, f,z.IFUSEMPTY(L),z, fHEAD(L)REDUCE(TAIL(L), f,z)) . 
(8.8) 
The only fly in this ointment is that the A calculus does not have the 
notion of recursion, and so this is an invalid definition. But of course 
we can use our RECURSE operator to solve this problem. We will 
replace the recursive call to “REDUCE” with a call to a function me 
that is given as an extra argument, and then apply RECURSE to this. 
Thus REDUCE = RECURSE myRE DUCE where 


myREDUCE = Ame, L, f, z. IF(ISEMPTY(L), z, fHEAD(L)me(TAIL(L), f, z)) . 
(8.9) 


8.7.2 The Y combinator, or recursion without recursion 

Eq. (8.9) means that implementing MAP, FILTER, and REDUCE boils 
down to implementing the RECURSE operator in the pure A calculus. 
This is what we do now. 

How can we implement recursion without recursion? We will 
illustrate this using a simple example - the XOR function. As shown in 
Solved Exercise 8.4, we can write the XOR function of a list recursively 
as follows: 


0 L is empty 


XOR(L) = eee es otherwise 


where XOR, : {0,1}? > {0,1} is the XOR on two bits. In Python we 
would write this as 


def xor2(a,b): return 1-b if a else b 
def head(L): return L[Q] 
def tail(L): return L[1:] 


def xor(L): return xor2(head(L),xor(tail(L))) if L else ð 


print(xor([@,1,1,0,0,1])) 
# 7 


Now, how could we eliminate this recursive call? The main idea is 
that since functions can take other functions as input, it is perfectly 
legal in Python (and the A calculus of course) to give a function itself 
as input. So, our idea is to try to come up with a non-recursive function 
tempxor that takes two inputs: a function and a list, and such that 
tempxor (tempxor,L) will output the XOR of L! 
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Our first attempt might be to simply use the idea of replacing the 


recursive call by me. Let’s define this function as myxor 


def myxor(me,L): return xor2(head(L),me(tail(L))) if L 
o else ð 


Let’s test this out: 
myxor(myxor,[1,2,1]) 


If you do this, you will get the following complaint from the inter- 
preter: 

TypeError: myxor() missing 1 required positional argu- 
ment 

The problem is that myxor expects two inputs- a function and a 
list- while in the call to me we only provided a list. To correct this, we 
modify the call to also provide the function itself: 


def tempxor(me,L): return xor2(head(L),me(me,tail(L))) if 
o L else ð 


Note the call me (me, . . ) in the definition of tempxor: given a func- 
tion me as input, tempxor will actually call the function me with itself 
as the first input. If we test this out now, we see that we actually get 
the right result! 


tempxor (tempxor,[1,0,1]) 
#0 

tempxor (tempxor,[1,0,1,1]) 
#1 


and so we can define xor (L) as simply return tem- 
pxor(tempxor,L). 

The approach above is not specific to XOR. Given a recursive func- 
tion f that takes an input x, we can obtain a non-recursive version as 
follows: 


1. Create the function myf that takes a pair of inputs me and x, and 
replaces recursive calls to f with calls to me. 


2. Create the function tempf that converts calls in myf of the form 
me(x) to calls of the form me(me, x). 


3. The function f(x) will be defined as tempf (tempf , x) 


Here is the way we implement the RECURSE operator in Python. It 
will take a function myf as above, and replace it with a function g such 
that g(x)=myf(g,x) for every x 


def RECURSE(myf): 
def tempf(me,x): return myf (lambda y: me(me,y),x) 


return lambda x: tempf(tempf, x) 


xor = RECURSE(myxor) 


print(xor([0,1,1,0,0,1])) 
# 7 


print(xor([1,1,0,0,1,1,1,1])) 
# 0 


From Python to the calculus. In the A calculus, a two input function 

g that takes a pair of inputs me, y is written as Ame.(Ay.g). So the 
function y œ> me(me,y) is simply written as me me and similarly 

the function x œ> tempf(tempf,x) is simply tempf tempf. (Can 

you see why?) Therefore the function tempf defined above can be 
written as A me. myf(me me). This means that if we denote the input 
of RECURSE by f, then RECURSE myf = tempf tempf where tempf = 
Am.f(m m) or in other words 


RECURSE = Xf.((Am.f(m m)) (Am.f(mm))) 


The online appendix contains an implementation of the A calcu- 
lus using Python. Here is an implementation of the recursive XOR 
function from that appendix: 


# XOR of two bits 
XOR2 = A(a,b)(IF(a, IF(b,_®,_1),b)) 


# Recursive XOR with recursive calls replaced by m 
o parameter 


myXOR = A(m,1)CIF(ISEMPTY(1),_@,XOR2(HEAD(1) ,m(TAIL(1))))) 


# Recurse operator (aka Y combinator) 
RECURSE = Af((Am(f (mxm) )) (AmCf (msm) ))) 


# XOR function 
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3 Because of specific issues of Python syntax, in this 
implementation we use f * g for applying f to g 
rather than fg, and use Ax(exp) rather than Ax. exp 
for abstraction. We also use _@ and _1 for the A terms 
for 0 and 1 so as not to confuse with the Python 
constants. 
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XOR = RECURSE (myXOR) 


#TESTING: 


XOR(PAIR(_1,NIL)) # List [1] 
# equals 1 


XOR(PAIR(_1,PAIR(_@,PAIR(_1,NIL)))) # List [1,0,1] 
# equals @ 


8.8 THE CHURCH-TURING THESIS (DISCUSSION) 


“(In 1934], Church had been speculating, and finally definitely proposed, that 
the A-definable functions are all the effectively calculable functions .... When 
Church proposed this thesis, I sat down to disprove it ... but, quickly realizing 
that [my approach failed], I became overnight a supporter of the thesis.”, 
Stephen Kleene, 1979. 


“(The thesis is] not so much a definition or to an axiom but ... a natural law.”, 
Emil Post, 1936. 


We have defined functions to be computable if they can be computed 
by a NAND-IM program, and we've seen that the definition would 
remain the same if we replaced NAND-IM programs by Python pro- 
grams, Turing machines, A calculus, cellular automata, and many 
other computational models. The Church-Turing thesis is that this is 
the only sensible definition of “computable” functions. Unlike the 
“Physical Extended Church-Turing Thesis” (PECTT) which we saw 
before, the Church-Turing thesis does not make a concrete physical 
prediction that can be experimentally tested, but it certainly motivates 
predictions such as the PECTT. One can think of the Church-Turing 
Thesis as either advocating a definitional choice, making some pre- 
diction about all potential computing devices, or suggesting some 
laws of nature that constrain the natural world. In Scott Aaronson’s 
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words, “whatever it is, the Church-Turing thesis can only be regarded 
as extremely successful”. No candidate computing device (including 
quantum computers, and also much less reasonable models such as 
the hypothetical “closed time curve” computers we mentioned before) 
has so far mounted a serious challenge to the Church-Turing thesis. 
These devices might potentially make some computations more effi- 
cient, but they do not change the difference between what is finitely 
computable and what is not. (The extended Church-Iuring thesis, 
which we discuss in Section 13.3, stipulates that Turing machines cap- 
ture also the limit of what can be efficiently computable. Just like its 
physical version, quantum computing presents the main challenge to 
this thesis. ) 


8.8.1 Different models of computation 
We can summarize the models we have seen in the following table: 


Table 8.1: Different models for computing finite functions and 
functions with arbitrary input length. 


Computational 

problems Type of model Examples 

Finite functions Non-uniform Boolean circuits, 

f + {0,1}" > {0,1} computation NAND circuits, 
(algorithm straight-line programs 


Functions with 
unbounded inputs 
F: {0,1}* > {0,1}* 


depends on input 
length) 
Sequential access 
to memory 


Indexed access / 
RAM 


Other 


(e.g., NAND-CIRC) 


Turing machines, 
NAND-IM programs 


RAM machines, 
NAND-RAM, modern 
programming 
languages 

Lambda calculus, 
cellular automata 


Later on in Chapter 17 we will study memory bounded computa- 
tion. It turns out that NAND-TM programs with a constant amount 
of memory are equivalent to the model of finite automata (the adjec- 
tives “deterministic” or “non-deterministic” are sometimes added as 
well, this model is also known as finite state machines) which in turn 
captures the notion of regular languages (those that can be described by 
regular expressions), which is a concept we will see in Chapter 10. 
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©) Chapter Recap 


e While we defined computable functions using 
Turing machines, we could just as well have done 
so using many other models, including not just 
NAND-TM programs but also RAM machines, 
NAND-RAM, the A-calculus, cellular automata and 
many other models. 

e Very simple models turn out to be “Turing com- 
plete” in the sense that they can simulate arbitrarily 
complex computation. 


8.9 EXERCISES 


Exercise 8.1 — Alternative proof for TM/RAM equivalence. Let SEARCH : 
{0,1}* — {0,1}* be the following function. The input is a pair 

(L, k) where k € {0,1}*, Lis an encoding of a list of key value pairs 
(ko, U1), 65 (Km—1sUm_—1) Where ko, ..., km—17 Vos ++; Um—1 are binary 
strings. The output is v; for the smallest i such that k; = k, if such i 
exists, and otherwise the empty string. 


1. Prove that SEARCH is computable by a Turing machine. 


2. Let UPDATE(L, k, v) be the function whose input is a list L of pairs, 
and whose output is the list L’ obtained by prepending the pair 
(k, v) to the beginning of L. Prove that UPDATE is computable by a 
Turing machine. 


3. Suppose we encode the configuration of a NAND-RAM program 
by a list L of key/value pairs where the key is either the name of 
a scalar variable foo or of the form Bar[<num>] for some num- 
ber <num> and it contains all the non-zero values of variables. Let 
NEXT(L) be the function that maps a configuration of a NAND- 
RAM program at one step to the configuration in the next step. 
Prove that NEXT is computable by a Turing machine (you don’t 
have to implement each one of the arithmetic operations: it is 
enough to implement addition and multiplication). 


4. Prove that for every F : {0,1}* — {0,1}* that is computable by a 
NAND-RAM program, F is computable by a Turing machine. 


Exercise 8.2 — NAND-TM lookup. This exercise shows part of the proof that 
NAND-IM can simulate NAND-RAM. Produce the code of a NAND- 
TM program that computes the function LOOKUP : {0,1}* — {0,1} 
that is defined as follows. On input pf(z)x, where p f (i) denotes a 

prefix-free encoding of an integer i, LOOKUP (pf(i)x) = a; ift < |z| 


and LOOKUP (pf(i)a) = 0 otherwise. (We don’t care what LOOKUP 
outputs on inputs that are not of this form.) You can choose any 
prefix-free encoding of your choice, and also can use your favorite 
programming language to produce this code. 


Exercise 8.3 — Pairing. Let embed : N? — N be the function defined as 
embed(xp, £1) = $(%p + 21)(Xp +2, +1) + z4. 


1. Prove that for every x°,x' € N, embed(x°, x!) is indeed a natural 
number. 


2. Prove that embed is one-to-one 


3. Construct a NAND-IM program P such that for every x°, x! € N, 
P(pf(x°)pf(a'!)) = pf(embed(x°, x')), where pf is the prefix-free 
encoding map defined above. You can use the syntactic sugar for 
inner loops, conditionals, and incrementing/decrementing the 
counter. 


4. Construct NAND-IM programs Po, P, such that for every 2°, x! € 
Nandi € N, P,(pf(embed(x°, x'))) = pf(ax*). You can use the syn- 
tactic sugar for inner loops, conditionals, and incrementing/decre- 
menting the counter. 


Exercise 8.4 — Shortest Path. Let SHORTPATH : {0,1}* — {0,1}* 

be the function that on input a string encoding a triple (G, u, v) out- 
puts a string encoding œ if u and v are disconnected in G or a string 
encoding the length k of the shortest path from u to v. Prove that 
SHORTPATH is computable by a Turing machine. See footnote for 
hint. 


Exercise 8.5 — Longest Path. Let LONGPATH : {0,1}* — {0,1}* be 
the function that on input a string encoding a triple (G, u, v) outputs 
a string encoding oo if u and v are disconnected in G or a string en- 
coding the length k of the longest simple path from u to v. Prove that 
LONGPATH is computable by a Turing machine. See footnote for 
hint.” 


Exercise 8.6 — Shortest path A expression. Let SHORTPATH be as in 
Exercise 8.4. Prove that there exists a \ expression that computes 
SHORTPATH. You can use Exercise 8.4 


Exercise 8.7 — Next-step function is local. Prove Lemma 8.9 and use it to 


complete the proof of Theorem 8.7. 
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* You don’t have to give a full description of a Turing 
machine: use our “have the cake and eat it too” 
paradigm to show the existence of such a machine by 
arguing from more powerful equivalent models. 


5 Same hint as Exercise 8.5 applies. Note that for 
showing that LONGPATH is computable you don’t 
have to give an efficient algorithm. 
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a 
Exercise 8.8 — A calculus requires at most three variables. Prove that for ev- 
ery A-expression e with no free variables there is an equivalent A- 
expression f that only uses the variables x,y, and z.° 

a 


Exercise 8.9 — Evaluation order example in A calculus. 1. Lete = 
Au.7 ((Ax.xx)(Ax.xx)). Prove that the simplification process of e 
ends in a definite number if we use the “call by name” evaluation 
order while it never ends if we use the “call by value” order. 


2. (bonus, challenging) Let e be any A expression. Prove that if the 
simplification process ends in a definite number if we use the “call 
by value” order then it also ends in such a number if we use the 
“call by name” order. See footnote for hint.” 


Exercise 8.10 — Zip function. Give an enhanced A calculus expression to 
compute the function zip that on input a pair of lists J and L of the 
same length n, outputs a list of n pairs M such that the j-th element 
of M (which we denote by M;) is the pair (J;, L;). Thus zip “zips 


together” these two lists of elements into a single list of pairs.5 


Exercise 8.11 — Next-step function without RECURSE. Let M be a Turing 
machine. Give an enhanced A calculus expression to compute the 
next-step function NEXT ,, of M (as in the proof of Theorem 8.16) 
without using RECURSE. See footnote for hint.” 


Exercise 8.12 — A calculus to NAND-TM compiler (challenging). Give a program 
in the programming language of your choice that takes as input a A 
expression e and outputs a NAND-IM program P that computes the 
same function as e. For partial credit you can use the GOTO and all 
NAND-CIRC syntactic sugar in your output program. You can use 
any encoding of à expressions as binary string that is convenient for 
you. See footnote for hint.!° 

m 


Exercise 8.13 — At least two in \ calculus. Let 1 = Ax, y.x and 0 = Ax, y.y as 
before. Define 


ALT = Aa, b, c.(a(b1(c10)) (bc0)) 


Prove that ALT is a À expression that computes the at least two func- 
tion. That is, for every a, b,c € {0,1} (as encoded above) ALTabc = 1 
if and only at least two of {a, b, c} are equal to 1. 


é Hint: You can reduce the number of variables a 
function takes by “pairing them up”. That is, define a 
A expression PAIR such that for every x, y PAIR«y is 
some function f such that fO = x and f1 = y. Then 
use PAIR to iteratively reduce the number of variables 
used. 


7 Use structural induction on the expression e. 


8 The name zip is a common name for this operation, 
for example in Python. It should not be confused with 
the zip compression file format. 


°? Use MAP and REDUCE (and potentially FILTER). 
You might also find the function zip of Exercise 8.10 
useful. 


10 Try to set up a procedure such that if array Left 
contains an encoding of a A expression Ax.e and 
array Right contains an encoding of another A expres- 
sion e’, then the array Result will contain e[x > e’]. 
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Exercise 8.14 — Locality of next-step function. This question will help you 
get a better sense of the notion of locality of the next step function of Tur- 
ing machines. This locality plays an important role in results such as 
the Turing completeness of À calculus and one dimensional cellular 
automata, as well as results such as Godel’s Incompleteness Theorem 
and the Cook Levin theorem that we will see later in this course. De- 
fine STRINGS to be the a programming language that has the following 
semantics: 


e A STRINGS program Q has a single string variable str that is both 
the input and the output of Q. The program has no loops and no 
other variables, but rather consists of a sequence of conditional 
search and replace operations that modify str. 


e The operations of a STRINGS program are: 


— REPLACE(pattern1,pattern2) where pattern! and pattern2 
are fixed strings. This replaces the first occurrence of pattern1 
in str with pattern2 

— if search(pattern) { code }executes code if pattern isa 
substring of str. The code code can itself include nested if’s. 
(One can also add an else { ... } to execute if pattern is not 
a substring of condf). 

— the returned value is str 


e A STRING program Q computes a function F : {0,1}* > {0,1}* if 
for every x € {0,1}*, if we initialize str to x and then execute the 
sequence of instructions in Q, then at the end of the execution str 
equals F(x). 


For example, the following is a STRINGS program that computes 
the function F : {0,1}* — {0,1}* such that for every x € {0,1}*, if x 
contains a substring of the form y = 11ab11 where a,b € {0,1}, then 
F(x) = a’ where 2’ is obtained by replacing the first occurrence of y in 
x with 00. 


if search('110011') { 
replace('110011', 'Q0') 

} else if search('110111') { 
replace('110111', 'Q0') 

} else if search('111011') { 
replace('111011', 'Q0') 

} else if search('111111') { 
replace('1111111', 'Q0') 
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Prove that for every Turing machine program M, there exists a 
STRINGS program Q that computes the NEXT y function that maps 
every string encoding a valid configuration of M to the string encoding 
the configuration of the next step of M’s computation. (We don’t 
care what the function will do on strings that do not encode a valid 
configuration.) You don’t have to write the STRINGS program fully, 
but you do need to give a convincing argument that such a program 
exists. 


8.10 BIBLIOGRAPHICAL NOTES 


Chapters 7 in the wonderful book of Moore and Mertens [MM11] 
contains a great exposition much of this material. . 

The RAM model can be very useful in studying the concrete com- 
plexity of practical algorithms. Its theoretical study was initiated in 
[CR73]. However, the exact set of operations that are allowed in the 
RAM model and their costs vary between texts and contexts. One 
needs to be careful in making such definitions, especially if the word 
size grows, as was already shown by Shamir [Sha79]. Chapter 3 in 
Savage’s book [Sav98] contains a more formal description of RAM 
machines, see also the paper [Hag98]. A study of RAM algorithms 
that are independent of the input size (known as the “transdichoto- 
mous RAM model”) was initiated by [FW93] 

The models of computation we considered so far are inherently 
sequential, but these days much computation happens in parallel, 
whether using multi-core processors or in massively parallel dis- 
tributed computation in data centers or over the Internet. Parallel 
computing is important in practice, but it does not really make much 
difference for the question of what can and can’t be computed. After 
all, if a computation can be performed using m machines in ¢ time, 
then it can be computed by a single machine in time mt. 

The A-calculus was described by Church in [Chu41]. Pierce’s book 
[ Pie02] is a canonical textbook, see also [Bar84]. The “Currying tech- 
nique” is named after the logician Haskell Curry (the Haskell pro- 
gramming language is named after Haskell Curry as well). Curry 
himself attributed this concept to Moses Schonfinkel, though for some 
reason the term “Schonfinkeling” never caught on. 

Unlike most programming languages, the pure A-calculus doesn’t 
have the notion of types. Every object in the A calculus can also be 
thought of as a A expression and hence as a function that takes one 
input and returns one output. All functions take one input and re- 
turn one output, and if you feed a function an input of a form it didn’t 
expect, it still evaluates the A expression via “search and replace”, 
replacing all instances of its parameter with copies of the input expres- 
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sion you fed it. Typed variants of the A calculus are objects of intense 
research, and are strongly related to type systems for programming 
language and computer-verifiable proof systems, see [Pie02]. Some of 
the typed variants of the A calculus do not have infinite loops, which 
makes them very useful as ways of enabling static analysis of pro- 
grams as well as computer-verifiable proofs. We will come back to this 
point in Chapter 10 and Chapter 22. 

Tao has proposed showing the Turing completeness of fluid dy- 
namics (a “water computer”) as a way of settling the question of the 
behavior of the Navier-Stokes equations, see this popular article. 
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Universality and uncomputability 


“A function of a variable quantity is an analytic expression composed in any 
way whatsoever of the variable quantity and numbers or constant quantities.” , 
Leonhard Euler, 1748. 


“The importance of the universal machine is clear. We do not need to have an 
infinity of different machines doing different jobs. ... The engineering problem 
of producing various machines for various jobs is replaced by the office work of 
‘programming’ the universal machine”, Alan Turing, 1948 


One of the most significant results we showed for Boolean circuits 
(or equivalently, straight-line programs) is the notion of universality: 
there is a single circuit that can evaluate all other circuits. However, 
this result came with a significant caveat. To evaluate a circuit of s 
gates, the universal circuit needed to use a number of gates larger 
than s. It turns out that uniform models such as Turing machines or 
NAND-IM programs allow us to “break out of this cycle” and obtain 
a truly universal Turing machine U that can evaluate all other machines, 
including machines that are more complex (e.g., more states) than U 
itself. (Similarly, there is a Universal NAND-TM program U’ that can 
evaluate all NAND-IM programs, including programs that have more 
lines than U’.) 

It is no exaggeration to say that the existence of such a universal 
program/machine underlies the information technology revolution 
that began in the latter half of the 20th century (and is still ongoing). 
Up to that point in history, people have produced various special- 
purpose calculating devices such as the abacus, the slide ruler, and 
machines that compute various trigonometric series. But as Turing 
(who was perhaps the one to see most clearly the ramifications of 
universality) observed, a general purpose computer is much more pow- 
erful. Once we build a device that can compute the single universal 
function, we have the ability, via software, to extend it to do arbitrary 
computations. For example, if we want to simulate a new Turing ma- 
chine M, we do not need to build a new physical machine, but rather 
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The universal machine/program - “one 
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A fundamental result in computer science and 
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programming languages, and software 
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can represent M as a string (i.e., using code) and then input M to the 
universal machine U. 

Beyond the practical applications, the existence of a universal algo- 
rithm also has surprising theoretical ramifications, and in particular 
can be used to show the existence of uncomputable functions, upend- 
ing the intuitions of mathematicians over the centuries from Euler 
to Hilbert. In this chapter we will prove the existence of the univer- 
sal program, and also show its implications for uncomputability, see 
Fig. 9.1 


9.1 UNIVERSALITY OR A META-CIRCULAR EVALUATOR 


We start by proving the existence of a universal Turing machine. This is 
a single Turing machine U that can evaluate arbitrary Turing machines 
M on arbitrary inputs x, including machines M that can have more 
states and larger alphabet than U itself. In particular, U can even be 
used to evaluate itself! This notion of self reference will appear time and 
again in this book, and as we will see, leads to several counter-intuitive 
phenomena in computing. 


Universal Turing Machine 


¥ 


[ Uncomputable function WH ) 


4 


me Halting Problem Reductions 


e i 
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Theorem 9.1 — Universal Turing Machine. There exists a Turing machine 
U such that on every string M which represents a Turing machine, 
and x € {0,1}*, U(M, x) = M(x). 

That is, if the machine M halts on x and outputs some y € 
{0, 1}* then U(M, x) = y, and if M does not halt on z (i.e., 
M(a) = L) then U(M,«x) = L. 


Proof Idea: 

Once you understand what the theorem says, it is not that hard to 
prove. The desired program U is an interpreter for Turing machines. 
That is, U gets a representation of the machine M (think of it as source 
code), and some input x, and needs to simulate the execution of M on 
T. 

Think of how you would code U in your favorite programming 
language. First, you would need to decide on some representation 
scheme for M. For example, you can use an array or a dictionary 
to encode M’s transition function. Then you would use some data 
structure, such as a list, to store the contents of M’s tape. Now you can 
simulate M step by step, updating the data structure as you go along. 
The interpreter will continue the simulation until the machine halts. 

Once you do that, translating this interpreter from your favorite 
programming language to a Turing machine can be done just as we 
have seen in Chapter 8. The end result is what’s known as a “meta- 
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Figure 9.1: In this chapter we will show the existence 
of a universal Turing machine and then use this to de- 
rive first the existence of some uncomputable function. 
We then use this to derive the uncomputability of 
Turing’s famous “halting problem” (i.e., the HALT 
function), from which a host of other uncomputabil- 
ity results follow. We also introduce reductions, which 
allow us to use the uncomputability of a function F to 
derive the uncomputability of a new function G. 


U(M,x) = M(x) 


M | 


Po1o0ito1101101101101 101101 10110101110] 110011011011111101101 


U 


Figure 9.2: A Universal Turing Machine is a single 
Turing Machine U that can evaluate, given input the 
(description as a string of) arbitrary Turing machine 
M and input z, the output of M on =. In contrast to 
the universal circuit depicted in Fig. 5.6, the machine 
M can be much more complex (e.g., more states or 
tape alphabet symbols) than U. 
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circular evaluator”: an interpreter for a programming language in the 

same one. This is a concept that has a long history in computer science 

starting from the original universal Turing machine. See also Fig. 9.3. 
* 


9.1.1 Proving the existence of a universal Turing Machine 

To prove (and even properly state) Theorem 9.1, we need to fix some 
representation for Turing machines as strings. One potential choice 
for such a representation is to use the equivalence between Turing 
machines and NAND-TM programs and hence represent a Turing 
machine M using the ASCII encoding of the source code of the corre- 
sponding NAND-TM program P. However, we will use a more direct 
encoding. 


Definition 9.2 — String representation of Turing Machine. Let M be a Turing 
machine with k states and a size £ alphabet © = {0 ,...,0,_,} (we 
use the convention o) = 0,0, = 1,0, = Ø, 03 = >). We represent 
M as the triple (k, £, T) where T is the table of values for ôy: 


T= (ô (0, o0), 6y4(0, 01), gee Ou (k = 1, 74_1)) , 


where each value ô (s, 7) is a triple (s’,o’,d) with s’ € [k], 
o’ € Nand d a number {0, 1, 2, 3} encoding one of {L, R, S, H}. Thus 
such a machine M is encoded by a list of 2 + 3k - £ natural num- 
bers. The string representation of M is obtained by concatenating a 
prefix free representation of all these integers. If a string œ € {0,1}* 
does not represent a list of integers in the form above, then we treat 
it as representing the trivial Turing machine with one state that 
immediately halts on every input. 


Using this representation, we can formally prove Theorem 9.1. 


Proof of Theorem 9.1. We will only sketch the proof, giving the major 
ideas. First, we observe that we can easily write a Python program 
that, on input a representation (k, £, T) of a Turing machine M and 
an input z, evaluates M on X. Here is the code of this program for 
concreteness, though you can feel free to skip it if you are not familiar 
with (or interested in) Python: 


# constants 
def EVAL(8,x): 
'''Evaluate TM given by transition table ò 
on input x''' 
Tape = [" "] + [a for a in x] 
i = 0; s = 0 # i = head pos, s = state 
while True: 
s, Tapeli], d = d[(s,TapeLi])] 
if d == "H": break 
if d == "L": i = max(i-1,0) 
if d == "R": i += 1 
if i>= len(Tape): Tape.append('®') 


j = 1; Y = [] # produce output 


while Tape[j] != 'Ọ': 
Y.append(Tape[j]) 
jt 

return Y 


On input a transition table 6 this program will simulate the cor- 
responding machine M step by step, at each point maintaining the 
invariant that the array Tape contains the contents of M’s tape, and 
the variable s contains M’s current state. 

The above does not prove the theorem as stated, since we need 
to show a Turing machine that computes EVAL rather than a Python 
program. With enough effort, we can translate this Python code 
line by line to a Turing machine. However, to prove the theorem we 
don’t need to do this, but can use our “eat the cake and have it too” 
paradigm. That is, while we need to evaluate a Turing machine, in 
writing the code for the interpreter we are allowed to use a richer 
model such as NAND-RAM since it is equivalent in power to Turing 
machines per Theorem 8.1. 
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Translating the above Python code to NAND-RAM is truly straight- 
forward. The only issue is that NAND-RAM doesn’t have the dictio- 
nary data structure built in, which we have used above to store the 
transition function 5. However, we can represent a dictionary D of 
the form {keyg : valo,..., keym; : valı} as simply a list of pairs. 

To compute D[k] we can scan over all the pairs until we find one of 
the form (k, v) in which case we return v. Similarly we scan the list 
to update the dictionary with a new value, either modifying it or ap- 
pending the pair (key, val) at the end. 
a 


The construction above yields a universal Turing machine with a 


very large number of states. However, since universal Turing machines 
have such a philosophical and technical importance, researchers have 
attempted to find the smallest possible universal Turing machines, see 
Section 9.7. 


9.1.2 Implications of universality (discussion) 

There is more than one Turing machine U that satisfies the condi- 
tions of Theorem 9.1, but the existence of even a single such machine 
is already extremely fundamental to both the theory and practice of 
computer science. Theorem 9.1’s impact reaches beyond the particu- 
lar model of Turing machines. Because we can simulate every Turing 
machine by a NAND-IM program and vice versa, Theorem 9.1 im- 
mediately implies there exists a universal NAND-IM program Py 
such that Py(P,x) = P(x) for every NAND-IM program P. We can 
also “mix and match” models. For example since we can simulate 
every NAND-RAM program by a Turing machine, and every Turing 
machine by the A calculus, Theorem 9.1 implies that there exists a A 
expression e such that for every NAND-RAM program P and input x 
on which P(x) = y, if we encode (P, x) as a A-expression f (using the 


a) 3.3 The Universal S-function, Apply : b) char st) =! 
There is an S-funetion apply such thet if f is an sty 
S-expression for an $-functionQvand args ip a list of the ' 
fora (argl,..., arg n) where argl,---,arg n are arbitrary 
S-expressions then apply[f, args) and Pfargl;...;arg o) Nn, 
are defined for the same values of argl,...,arg n and are a 
equal when defined. 
apply is defined by \n', 
apply/[f ;args] ~eval [combine [f ;args]} An’ 
eval is define by s 
eval fe) of 1 
firat [e] -NULL —fru11 [eval [first[rest[e]]] Jar; mF] KA 
first [6] -ATONfatom[eval[tirst{rest[e]}]] aT; mF] yn’, 
first [e] -EQ—eval[tirst [rest[e]]]=-eva1[first[rest[rest[e]]] JT; (213 lines deleted) 
rr] 0 


first [e] ~quore-3first{rest[e]}); k 

first [e] -FIRST—first [eva] [r1rst[rest[e]]] Ji 

first [e) -REST—rest [eval/rirst [rest [e] )]]; A 

first [e]-COMBINE—7combine [eval [tirst[rest{o]]];eval [first [rest [rest « The string s is a 


e: + representation of the body 


first [e]=COND~yeveon[restfe]]i i pe 
REEN POTA PE RE N A CAE ])stirat [rest fest = of this program from 'O 


First [e]]])];rest fe]) ; » to the end. 
first [firat ([e]] =LABEL—eval [comzine [subst [tirst[e];first [rest ./ 
[rivet (e)]]stirat [reat [rest [rirat (e}] }JJsrest[e}))) 
where; evcon [c] = [eval [first [eirst [o}}]=2-peval [tiret [rest [tiret[e o})) } main( ) 
Teveon[rest[e]]] { 
er i inti 
evlam[vara;exp;arga) = fnu11[vars}—reval [exp] ;1—0v1an[ 
rest [vara] ;eubat [first [arga] sfiret[vars] ;exp];rest fargs])) es tie 
‘The pe of ee pi eta i s Nea l eh =e printt("char\ts[_] = {\n"); 
subexpressions of e. The process described by the above for(i=0; sli]; i++) 
functions is exactly the process used in the hand-worked printf("\t%d, \n", s[i]); 
examples of section 2.5. printi("%s", s); 


A-calculus encoding of strings as lists of 0’s and 1’s) then (e f) eval- 
uates to an encoding of y. More generally we can say that for every 

X and Y in the set { Turing machines, RAM Machines, NAND-IM, 
NAND-RAM, 4-calculus, JavaScript, Python, ... } of Turing equivalent 
models, there exists a program/machine in X that computes the map 
(P, x)= P(x) for every program/machine P € Y. 

The idea of a “universal program” is of course not limited to theory. 
For example compilers for programming languages are often used to 
compile themselves, as well as programs more complicated than the 
compiler. (An extreme example of this is Fabrice Bellard’s Obfuscated 
Tiny C Compiler which is a C program of 2048 bytes that can compile 
a large subset of the C programming language, and in particular can 
compile itself.) This is also related to the fact that it is possible to write 
a program that can print its own source code, see Fig. 9.3. There are 
universal Turing machines known that require a very small number 
of states or alphabet symbols, and in particular there is a universal 
Turing machine (with respect to a particular choice of representing 
Turing machines as strings) whose tape alphabet is {[>,@,0,1} and 
has fewer than 25 states (see Section 9.7). 


9.2 IS EVERY FUNCTION COMPUTABLE? 


In Theorem 4.12, we saw that NAND-CIRC programs can compute 
every finite function f : {0,1}" — {0,1}. Therefore a natural guess is 
that NAND-TM programs (or equivalently, Turing machines) could 
compute every infinite function F : {0,1}* — {0,1}. However, this 
turns out to be false. That is, there exists a function F : {0,1}* — {0,1} 
that is uncomputable! 
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Figure 9.3: a) A particularly elegant example of a 
“meta-circular evaluator” comes from John Mc- 
Carthy’s 1960 paper, where he defined the Lisp 
programming language and gave a Lisp function that 
evaluates an arbitrary Lisp program (see above). Lisp 
was not initially intended as a practical program- 
ming language and this example was merely meant 
as an illustration that the Lisp universal function is 
more elegant than the universal Turing machine. It 
was McCarthy’s graduate student Steve Russell who 
suggested that it can be implemented. As McCarthy 
later recalled, “I said to him, ho, ho, you're confusing 
theory with practice, this eval is intended for reading, not 
for computing. But he went ahead and did it. That is, he 
compiled the eval in my paper into IBM 704 machine code, 
fixing a bug, and then advertised this as a Lisp interpreter, 
which it certainly was”. b) A self-replicating C program 
from the classic essay of Thompson [Tho84]. 
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The existence of uncomputable functions is quite surprising. Our 
intuitive notion of a “function” (and the notion most mathematicians 
had until the 20th century) is that a function f defines some implicit 
or explicit way of computing the output f(x) from the input x. The 
notion of an “uncomputable function” thus seems to be a contradic- 
tion in terms, but yet the following theorem shows that such creatures 
do exist: 


Theorem 9.5 — Uncomputable functions. There exists a function F* 
{0, 1}* — {0, 1} that is not computable by any Turing machine. 


Proof Idea: 

The idea behind the proof follows quite closely Cantor’s proof that 
the reals are uncountable (Theorem 2.5), and in fact the theorem can 
also be obtained fairly directly from that result (see Exercise 7.11). 
However, it is instructive to see the direct proof. The idea is to con- 
struct F* in a way that will ensure that every possible machine M will 
in fact fail to compute F*. We do so by defining F*(«) to equal 0 if x 
describes a Turing machine M which satisfies M(x) = 1 and defining 
F*(x) = 1 otherwise. By construction, if M is any Turing machine and 
x is the string describing it, then F*(x) + M(x) and therefore M does 
not compute F™. 

* 


Proof of Theorem 9.5. The proof is illustrated in Fig. 9.4. We start by 
defining the following function G : {0,1}* > {0,1}: 

For every string x € {0,1}*, if x satisfies (1) x is a valid repre- 
sentation of some Turing machine M (per the representation scheme 
above) and (2) when the program M is executed on the input z it 
halts and produces an output, then we define G(x) as the first bit of 
this output. Otherwise (i.e., if x is not a valid representation of a Tur- 
ing machine, or the machine M, never halts on x) we define G(x) = 0. 
We define F* (x) = 1 — G(x). 

We claim that there is no Turing machine that computes F™*. In- 
deed, suppose, towards the sake of contradiction, there exists a ma- 
chine M that computes F*, and let x be the binary string that rep- 
resents the machine M. On one hand, since by our assumption M 
computes F*, on input x the machine M halts and outputs F*(x). On 
the other hand, by the definition of F*, since x is the representation 
of the machine M, F*(x) = 1 — G(x) = 1 — M(x), hence yielding a 
contradiction. 


The type of argument used to prove Theorem 9.5 is known as di- 
agonalization since it can be described as defining a function based 
on the diagonal entries of a table as in Fig. 9.4. The proof can be 
thought of as an infinite version of the counting argument we used 


for showing lower bound for NAND-CIRC programs in Theorem 5.3. 


Namely, we show that it’s not possible to compute all functions from 
{0,1}* — {0,1} by Turing machines simply because there are more 
functions like that than there are Turing machines. 

As mentioned in Remark 7.4, many texts use the “language” ter- 
minology and so will calla set L C {0,1}* an undecidable or non- 
recursive language if the function F : {0,1}* — {0,1} such that 
F(x) = 1 + 2 € Lis uncomputable. 


UNIVERSALITY AND UNCOMPUTABILITY 333 


Figure 9.4: We construct an uncomputable function 
by defining for every two strings x, y the value 

1 — M,,(x) which equals 0 if the machine described 
by y outputs 1 on x, and 1 otherwise. We then define 
F*™(x) to be the “diagonal” of this table, namely 
F*(x) = 1 — M,(«) for every x. The function F* 

is uncomputable, because if it was computable by 
some machine whose string description is x* then we 
would get that M,+(x*) = F*(a*) = 1 — M,«(a*). 
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9.3 THE HALTING PROBLEM 


Theorem 9.5 shows that there is some function that cannot be com- 
puted. But is this function the equivalent of the “tree that falls in the 
forest with no one hearing it”? That is, perhaps it is a function that 
no one actually wants to compute. It turns out that there are natural 
uncomputable functions: 


Theorem 9.6 — Uncomputability of Halting function. Let HALT : {0,1}* > 
{0, 1} be the function such that for every string M € {0,1 ¥, 
HALT(M,x) = 1if Turing machine M halts on the input z and 
HALT(M, x) = 0 otherwise. Then HALT is not computable. 


Before turning to prove Theorem 9.6, we note that HALT is a very 
natural function to want to compute. For example, one can think of 
HALT as a special case of the task of managing an “App store”. That 
is, given the code of some application, the gatekeeper for the store 
needs to decide if this code is safe enough to allow in the store or not. 
Ata minimum, it seems that we should verify that the code would not 
go into an infinite loop. 


Proof Idea: 


One way to think about this proof is as follows: 
Uncomputability of F* + Universality = Uncomputability of HALT 


That is, we will use the universal Turing machine that computes EVAL 
to derive the uncomputability of HALT from the uncomputability of 
F* shown in Theorem 9.5. Specifically, the proof will be by contra- 
diction. That is, we will assume towards a contradiction that HALT is 
computable, and use that assumption, together with the universal Tur- 
ing machine of Theorem 9.1, to derive that F* is computable, which 
will contradict Theorem 9.5. 


* 


Proof of Theorem 9.6. The proof will use the previously established 
result Theorem 9.5. Recall that Theorem 9.5 shows that the following 
function F* : {0,1}* — {0,1} is uncomputable: 


0 otherwise 


F*(2) = n x(x) = Oorx(a) = L 


where x(x) denotes the output of the Turing machine described by the 
string x on the input x (with the usual convention that x(x) = L if this 
computation does not halt). 

We will show that the uncomputability of F* implies the uncom- 
putability of HALT. Specifically, we will assume, towards a contra- 
diction, that there exists a Turing machine M that can compute the 
HALT function, and use that to obtain a Turing machine M’ that com- 
putes the function F*. (This is known as a proof by reduction, since we 
reduce the task of computing F* to the task of computing HALT. By 
the contrapositive, this means the uncomputability of F* implies the 
uncomputability of HALT.) 

Indeed, suppose that M is a Turing machine that computes HALT. 
Algorithm 9.7 describes a Turing machine M’ that computes F*. (We 
use “high level” description of Turing machines, appealing to the 


“have your cake and eat it too” paradigm, see Big Idea 10.) 


We claim that Algorithm 9.7 computes the function F*. In- 
deed, suppose that x(x) = 1 (and hence F*(x) = 0). In this 
case, HALT(x,z) = 1 and hence, under our assumption that 
M(x,x) = HALT(z,z), the value z will equal 1, and hence Al- 
gorithm 9.7 will sety = x(x) = 1, and output the correct value 
0. 

Suppose otherwise that x(x) # 0. In this case there are two possibil- 
ities: 


e Case 1: The machine described by x does not halt on the input x 
(and hence F*(x) = 1). In this case, HALT(x,x) = 0. Since we 
assume that M computes HALT it means that on input x, x, the 
machine M must halt and output the value 0. This means that 
Algorithm 9.7 will set z = 0 and output 1. 
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e Case 2: The machine described by x halts on the input « and out- 
puts some y’ # 0 (and hence F*(x) = 0). In this case, since 
HALT(a,x) = 1, under our assumptions, Algorithm 9.7 will set 
y = y’ + 0 and so output 0. 


We see that in all cases, M’ (x) = F*(x), which contradicts the 
fact that F* is uncomputable. Hence we reach a contradiction to our 
original assumption that M computes HALT. 


9.3.1 Is the Halting problem really hard? (discussion) 


Many people’s first instinct when they see the proof of Theorem 9.6 

is to not believe it. That is, most people do believe the mathematical 
statement, but intuitively it doesn’t seem that the Halting problem is 
really that hard. After all, being uncomputable only means that HALT 
cannot be computed by a Turing machine. 

But programmers seem to solve HALT all the time by informally or 
formally arguing that their programs halt. It’s true that their programs 
are written in C or Python, as opposed to Turing machines, but that 
makes no difference: we can easily translate back and forth between 
this model and any other programming language. 

While every programmer encounters at some point an infinite loop, 
is there really no way to solve the halting problem? Some people 
argue that they personally can, if they think hard enough, determine 
whether any concrete program that they are given will halt or not. 
Some have even argued that humans in general have the ability to 
do that, and hence humans have inherently superior intelligence to 
computers or anything else modeled by Turing machines.! 

The best answer we have so far is that there truly is no way to solve 
HALT, whether using Macs, PCs, quantum computers, humans, or 
any other combination of electronic, mechanical, and biological de- 
vices. Indeed this assertion is the content of the Church-Turing Thesis. 
This of course does not mean that for every possible program P, it 
is hard to decide if P enters an infinite loop. Some programs don’t 
even have loops at all (and hence trivially halt), and there are many 


1 This argument has also been connected to the 
issues of consciousness and free will. Iam personally 
skeptical of its relevance to these issues. Perhaps the 
reasoning is that humans have the ability to solve the 
halting problem but they exercise their free will and 
consciousness by choosing not to do so. 
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other far less trivial examples of programs that we can certify to never 
enter an infinite loop (or programs that we know for sure that will 
enter such a loop). However, there is no general procedure that would 
determine for an arbitrary program P whether it halts or not. More- 
over, there are some very simple programs for which no one knows 
whether they halt or not. For example, the following Python program 
will halt if and only if Goldbach’s conjecture is false: 


def isprime(p): 
return all(p % i for i in range(2,p-1)) 


def Goldbach(n): 
return any( (isprime(p) and isprime(n-p)) 
for p in range(2,n-1)) 


n= 4 

while True: 
if not Goldbach(n): break 
nt= 2 


Given that Goldbach’s Conjecture has been open since 1742, it is 
unclear that humans have any magical ability to say whether this (or 
other similar programs) will halt or not. 


THE HALTING PROBLEM IS EASY TO SOLVE. 
IF THE PROGRAM RUNS TOO LONG, I TAKE 


THIS STICK AND BEAT THE COMPUTER 
UNTIL IT STOPS. 


9.3.2 A direct proof of the uncomputability of HALT (optional) 

It turns out that we can combine the ideas of the proofs of Theo- 

rem 9.5 and Theorem 9.6 to obtain a short proof of the latter theorem, 
that does not appeal to the uncomputability of F*. This short proof 
appeared in print in a 1965 letter to the editor of Christopher Strachey: 


To the Editor, The Computer Journal. 


What if Alan Turing had been an engineer? 


An Impossible Program Figure 9.5: SMBC’s take on solving the Halting prob- 


Sir, lem. 


A well-known piece of folk-lore among programmers holds that it is 
impossible to write a program which can examine any other program 
and tell, in every case, if it will terminate or get into a closed loop when 
it is run. I have never actually seen a proof of this in print, and though 
Alan Turing once gave me a verbal proof (in a railway carriage on the 
way to a Conference at the NPL in 1953), I unfortunately and promptly 
forgot the details. This left me with an uneasy feeling that the proof 
must be long or complicated, but in fact it is so short and simple that it 
may be of interest to casual readers. The version below uses CPL, but 
not in any essential way. 

Suppose T[R] is a Boolean function taking a routine (or program) R 
with no formal or free variables as its arguments and that for all R, 
TCR] = True if R terminates if run and that T[R] = False if R does not 
terminate. 
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Consider the routine P defined as follows 


rec routine P 
SL: if TIP] go to L 
Return § 


If TTP] = True the routine P will loop, and it will only terminate if 
T[P] = False. In each case T[P] has exactly the wrong value, and this 
contradiction shows that the function T cannot exist. 


Yours faithfully, 
C. Strachey 


Churchill College, Cambridge 


Since CPL is not as common today, let us reproduce this proof. The 
idea is the following: suppose for the sake of contradiction that there 
exists a program T such that T(f, x) equals True iff f halts on input 
x. (Strachey’s letter considers the no-input variant of HALT, but as 
we'll see, this is an immaterial distinction.) Then we can construct a 
program P and an input x such that T(P, x) gives the wrong answer. 
The idea is that on input x, the program P will do the following: run 
T(x, x), and if the answer is True then go into an infinite loop, and 
otherwise halt. Now you can see that T(P,P) will give the wrong 
answer: if P halts when it gets its own code as input, then T(P,P) is 
supposed to be True, but then P(P) will go into an infinite loop. And 
if P does not halt, then T(P,P) is supposed to be False but then P(P) 
will halt. We can also code this up in Python: 


def CantSolveMe(T): 

Gets function T that claims to solve HALT. 
Returns a pair (P,x) of code and input on which 
T(P,x) # HALT(x) 
def fool(x): 

if T(x,x): 

while True: pass 
return "I halted" 


return (fool, fool) 


For example, consider the following Naive Python program T that 
guesses that a given function does not halt if its input contains while 
or for 


def T(f,x): 
"Crude halting tester - decides it doesn't halt if it 
« contains a loop, 
import inspect 
source = inspect.getsource(f) 
if source.find("while"): return False 
if source.find("for"): return False 
return True 


If we now set (f,x) = CantSolveMe(T), then T(f,x)=False but 
f(x) does in fact halt. This is of course not specific to this particular T: 
for every program T, if we run (f,x) = CantSolveMe(T) then we'll 
get an input on which T gives the wrong answer to HALT. 


9.4 REDUCTIONS 


The Halting problem turns out to be a linchpin of uncomputability, in 
the sense that Theorem 9.6 has been used to show the uncomputabil- 
ity of a great many interesting functions. We will see several examples 
of such results in this chapter and the exercises, but there are many 
more such results (see Fig. 9.6). 


Semantic Context Free Game of Life Halting 
non-trivial Grammar Equivalence 


functions: 
Turing 


Machine ® = A Calculus Halting 
Computes Equivalence Halting 
Parity 


R ie 
\ 
Function XY 
e$ 
Rice’s Theorem 
` e] 


4 


Problem 


Quantified 
Mixed 
Statements 


Gédel’s 
Incompleteness 


Quantified Diophantine 
Equations 


Integer 
Statements (MRDP Theorem) 


Theorem 


The idea behind such uncomputability results is conceptually sim- 
ple but can at first be quite confusing. If we know that HALT is un- 
computable, and we want to show that some other function BLAH is 
uncomputable, then we can do so via a contrapositive argument (i.e., 
proof by contradiction). That is, we show that if there exists a Turing 
machine that computes BLAH then there exists a Turing machine that 
computes HALT. (Indeed, this is exactly how we showed that HALT 
itself is uncomputable, by deriving this fact from the uncomputability 
of the function F* of Theorem 9.5.) 
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Figure 9.6: Some uncomputability results. An arrow 
from problem X to problem Y means that we use the 
uncomputability of X to prove the uncomputability 
of Y by reducing computing X to computing Y. 

All of these results except for the MRDP Theorem 
appear in either the text or exercises. The Halting 
Problem HALT serves as our starting point for all 
these uncomputability results as well as many others. 
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For example, to prove that BLAH is uncomputable, we could show 
that there is a computable function R : {0,1}* — {0, 1}* such that for 
every pair M and x, HALT(M,x) = BLAH(R(M,<)). The existence of 
such a function R implies that if BLAH was computable then HALT 
would be computable as well, hence leading to a contradiction! The 
confusing part about reductions is that we are assuming something 
we believe is false (that BLAH has an algorithm) to derive something 
that we know is false (that HALT has an algorithm). Michael Sipser 
describes such results as having the form “If pigs could whistle then 
horses could fly”. 

A reduction-based proof has two components. For starters, since 
we need R to be computable, we should describe the algorithm to 
compute it. The algorithm to compute R is known as a reduction since 
the transformation R modifies an input to HALT to an input to BLAH, 
and hence reduces the task of computing HALT to the task of comput- 
ing BLAH. The second component of a reduction-based proof is the 
analysis of the algorithm R: namely a proof that R does indeed satisfy 
the desired properties. 

Reduction-based proofs are just like other proofs by contradiction, 
but the fact that they involve hypothetical algorithms that don’t really 
exist tends to make reductions quite confusing. The one silver lining 
is that at the end of the day the notion of reductions is mathematically 
quite simple, and so it’s not that bad even if you have to go back to 
first principles every time you need to remember what is the direction 


that a reduction should go in. 
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9.4.1 Example: Halting on the zero problem 

Here is a concrete example for a proof by reduction. We define the 
function HALTONZERO : {0,1}* — {0,1} as follows. Given any 
string M, HALTONZERO(M) = 1 if and only if M describes a Turing 
machine that halts when it is given the string 0 as input. A priori 
HALTONZERO seems like a potentially easier function to compute 
than the full-fledged HALT function, and so we could perhaps hope 
that it is not uncomputable. Alas, the following theorem shows that 
this is not the case: 


l Theorem 9.9 — Halting without input. HALTONZERO is uncomputable. 


Figure 9.7: To prove Theorem 9.9, we show that 
HALTONZERO is uncomputable by giving a reduction 
from the task of computing HALT to the task of com- 
puting HALTONZERO. This shows that if there was a 
hypothetical algorithm A computing HALTONZERO, 
then there would be an algorithm B computing 
HALT, contradicting Theorem 9.6. Since neither A nor 

: See : B actually exists, this is an example of an implication 
Input: TM M, string x yi : of the form “if pigs could whistle then horses could 
fly”. 


Algorithm B for HALT using A. : Hypothetical Algorithm A 
! for HALTONZERO 


Operation: 


1. Write code of TM Nm x: 
“Ignore input and run M(x)” 


2. Return A(Nm,x) 


Proof of Theorem 9.9. The proof is by reduction from HALT, see 
Fig. 9.7. We will assume, towards the sake of contradiction, that 
HALTONZERO is computable by some algorithm A, and use this 
hypothetical algorithm A to construct an algorithm B to compute 
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HALT, hence obtaining a contradiction to Theorem 9.6. (As discussed 
in Big Idea 10, following our “have your cake and eat it too” paradigm, 
we just use the generic name “algorithm” rather than worrying 
whether we model them as Turing machines, NAND-IM programs, 
NAND-RAM., etc.; this makes no difference since all these models are 
equivalent to one another.) 

Since this is our first proof by reduction from the Halting prob- 
lem, we will spell it out in more details than usual. Such a proof by 
reduction consists of two steps: 


1. Description of the reduction: We will describe the operation of our 
algorithm B, and how it makes “function calls” to the hypothetical 
algorithm A. 


2. Analysis of the reduction: We will then prove that under the hypoth- 
esis that Algorithm A computes HALTONZERO, Algorithm B will 
compute HALT. 


Our Algorithm B works as follows: on input M, x, it runs Algo- 
rithm 9.10 to obtain a Turing machine M’, and then returns A(M’). 
The machine M’ ignores its input z and simply runs M on zg. 


In pseudocode, the program Nj, s will look something like the 
following: 


# a string constant containing x 
return eval(M,x) 
# note that we ignore the input z 


That is, if we think of Nj, ,, as a program, then it is a program that 
contains M and «x as “hardwired constants”, and given any input z, it 
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simply ignores the input and always returns the result of evaluating 
M on x. The algorithm B does not actually execute the machine N m x- 
B merely writes down the description of Nj, x as a string (just as we 
did above) and feeds this string as input to A. 

The above completes the description of the reduction. The analysis is 
obtained by proving the following claim: 

Claim: For every strings M, x, z, the machine Nj, ,, constructed by 
Algorithm B in Step 1 satisfies that N m» halts on z if and only if the 
program described by M halts on the input z. 

Proof of Claim: Since N y» ignores its input and evaluates M on x 
using the universal Turing machine, it will halt on z if and only if M 
halts on z. 

In particular if we instantiate this claim with the input z = 0 to 
N m z we see that HALTONZERO(Ny, s) = HALT(M, x). Thus if 
the hypothetical algorithm A satisfies A(M) = HALTONZERO(M) 
for every M then the algorithm B we construct satisfies B(M, £) = 
HALT(M, x) for every M, z, contradicting the uncomputability of 
HALT. 


Figure 9.8: A Python implementation of the reduction 
showing that HALTONZERO is uncomputable if 
HALT is. See this Colab notebook for a full implemen- 
tation of the reduction. 


def B(P,x): 
"""B will solve the Halting problem 
if A solves the HALTONZERO problem 
INPUT: 
P: source code of Python function 
x: input to P 
USES: Black box A(') 
If VO A(Q)=HALTONZERO(Q) then will return HALT(P,x)""" 


# extract name of function defined in P 
i,j = P.index("def"), P.index("(") 


func = P[it3:j] 


Q x = f"def Q(z):\n {func}('{x}')\n"+P 


return A(Q x) 
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9.5 RICE’S THEOREM AND THE IMPOSSIBILITY OF GENERAL 
SOFTWARE VERIFICATION 


The uncomputability of the Halting problem turns out to be a special 
case of a much more general phenomenon. Namely, that we cannot 
certify semantic properties of general purpose programs. “Semantic prop- 
erties” mean properties of the function that the program computes, as 
opposed to properties that depend on the particular syntax used by 
the program. 

An example for a semantic property of a program P is the property 
that whenever P is given an input string with an even number of 1’s, 
it outputs 0. Another example is the property that P will always halt 
whenever the input ends with a 1. In contrast, the property that a C 
program contains a comment before every function declaration is not 
a semantic property, since it depends on the actual source code as 
opposed to the input/output relation. 

Checking semantic properties of programs is of great interest, as it 
corresponds to checking whether a program conforms to a specifica- 
tion. Alas it turns out that such properties are in general uncomputable. 
We have already seen some examples of uncomputable semantic func- 
tions, namely HALT and HALTONZERO, but these are just the “tip of 
the iceberg”. We start by observing one more such example: 


Theorem 9.12 — Computing all zero function. Let ZEROFUNC : {0,1}* > 
{0, 1} be the function such that for every M € {0,1}*, ZEROFUNC(M 
1 if and only if M represents a Turing machine such that M outputs 
0 on every input x € {0, 1}*. Then ZEROFUNC is uncomputable. 


j= 


Proof of Theorem 9.12. The proof is by reduction from HALTONZERO. 
Suppose, towards the sake of contradiction, that there was an algo- 
rithm A such that A(M) = ZEROFUNC(M) for every M € {0,1}*. 
Then we will construct an algorithm B that solves HALTONZERO, 
contradicting Theorem 9.9. 

Given a Turing machine N (which is the input to HALTONZERO), 
our Algorithm B does the following: 


1. Construct a Turing machine M which on input x € {0,1}", first 
runs N(0) and then outputs 0. 


2. Return A(M). 


Now if N halts on the input 0 then the Turing machine M com- 
putes the constant zero function, and hence under our assumption 
that A computes ZEROFUNC, A(M) = 1. If N does not halt on the 
input 0, then the Turing machine M will not halt on any input, and 
so in particular will not compute the constant zero function. Hence 
under our assumption that A computes ZEROFUNC, A(M) = 0. 
We see that in both cases, ZEROFUNC(M) = HALTONZERO(N) 
and hence the value that Algorithm B returns in step 2 is equal to 
HALTONZERO(N) which is what we needed to prove. 


Another result along similar lines is the following: 


Theorem 9.13 — Uncomputability of verifying parity. The following func- 
tion is uncomputable 


1 P computes the parity function 


COMPUTES-PARITY(P) = 
0 otherwise 


9.5.1 Rice’s Theorem 


Theorem 9.13 can be generalized far beyond the parity function. In 
fact, this generalization rules out verifying any type of semantic spec- 
ification on programs. We define a semantic specification on programs 
to be some property that does not depend on the code of the program 
but just on the function that the program computes. 

For example, consider the following two C programs 
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int First(int n) { 
if (n<®) return ð; 
return 2*n; 


int Second(int n) { 
int i = ð; 
int j = 0 
if (n<0) return ð; 
while (j<n) { 


i = i+ 2; 
j=j +1; 
} 
return i; 


First and Second are two distinct C programs, but they compute 
the same function. A semantic property, would be either true for both 
programs or false for both programs, since it depends on the function 
the programs compute and not on their code. An example for a se- 
mantic property that both First and Second satisfy is the following: 
“The program P computes a function f mapping integers to integers satisfy- 
ing that f(n) > n for every input n”. 

A property is not semantic if it depends on the source code rather 
than the input/output behavior. For example, properties such as “the 
program contains the variable k” or “the program uses the while op- 
eration” are not semantic. Such properties can be true for one of the 
programs and false for others. Formally, we define semantic proper- 
ties as follows: 


Definition 9.14 — Semantic properties. A pair of Turing machines 

M and M’ are functionally equivalent if for every x € {0, 1}*, 

M(x) = M'(a). (In particular, M(x) = Liff M'(x) = forall 
A function F : {0,1}* — {0,1} is semantic if for every pair of 

strings M, M’ that represent functionally equivalent Turing ma- 

chines, F(M) = F(M’). (Recall that we assume that every string 

represents some Turing machine, see Remark 9.3) 


There are two trivial examples of semantic functions: the constant 
one function and the constant zero function. For example, if Z is the 
constant zero function (i.e., Z(M) = 0 for every M) then clearly 
F(M) = F(M’) for every pair of Turing machines M and M” that are 
functionally equivalent M and M’. Here is a non-trivial example 
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Solved Exercise 9.1 — ZEROFUNC is semantic. Prove that the function 
ZEROFUNC is semantic. 


Solution: 

Recall that ZEROFUNC(M) = 1lifandonlyif M(x) = 0 for 
every x € {0,1}*. If M and M’ are functionally equivalent, then for 
every z, M(x) = M’(x). Hence ZEROFUNC(M) = 1 if and only if 
ZEROFUNC(M’) = 1. 


Often the properties of programs that we are most interested in 
computing are the semantic ones, since we want to understand the 
programs’ functionality. Unfortunately, Rice’s Theorem tells us that 
these properties are all uncomputable: 


Theorem 9.15 — Rice’s Theorem. Let F : {0,1}* — {0,1}. If F is seman- 
tic and non-trivial then it is uncomputable. 


Proof Idea: 

The idea behind the proof is to show that every semantic non- 
trivial function F is at least as hard to compute as HALTONZERO. 
This will conclude the proof since by Theorem 9.9, HALTONZERO 
is uncomputable. If a function F is non-trivial then there are two 
machines Mo and M, such that F(M,)) = Oand F(M,) = 1. So, 
the goal would be to take a machine N and find a way to map it into 
a machine M = R(N), such that (i) if N halts on zero then M is 
functionally equivalent to M, and (ii) if N does not halt on zero then 
M is functionally equivalent to Mọ. 

Because F is semantic, if we achieved this, then we would be guar- 
anteed that HALTONZERO(N) = F(R(N)), and hence would show 
that if F was computable, then HALTONZERO would be computable 
as well, contradicting Theorem 9.9. 

* 


Proof of Theorem 9.15. We will not give the proof in full formality, but 
rather illustrate the proof idea by restricting our attention to a particu- 
lar semantic function F. However, the same techniques generalize to 
all possible semantic functions. Define MONOTONE : {0, 1}* — {0,1} 
as follows: MONOTONE(M) = 1 if there does not existn € N and 
two inputs x, x’ € {0,1}” such that for every i € [n] x; < x; but M(x) 
outputs 1 and M(x’) = 0. That is MONOTONE(M) = 1 if it’s not 
possible to find an input z such that flipping some bits of x from 0 to 

1 will change M’s output in the other direction from 1 to 0. We will 
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prove that MONOTONE is uncomputable, but the proof will easily 
generalize to any semantic function. 

We start by noting that MONOTONE is neither the constant zero 
nor the constant one function: 


e The machine INF that simply goes into an infinite loop on every 
input satisfies MONOTONE(INF) = 1, since INF is not defined 
anywhere and so in particular there are no two inputs x, x’ where 
x; < x; for every i but INF(x) = 0 and INF(a’) = 1. 


e The machine PAR that computes the XOR or parity of its input, is 
not monotone (e.g., PAR(1, 1,0, 0, ...,0) = 0 but PAR(1,0,0,...,0) = 
0) and hence MONOTONE (PAR) = 0. 


(Note that INF and PAR are machines and not functions.) 

We will now give a reduction from HALTONZERO to 
MONOTONE. That is, we assume towards a contradiction that 
there exists an algorithm A that computes MONOTONE and we will 
build an algorithm B that computes HALTONZERO. Our algorithm B 
will work as follows: 


Algorithm B: 


Input: String N describing a Turing machine. (Goal: Compute 
HALTONZERO(N)) 


Assumption: Access to Algorithm A to compute MONOTONE. 


Operation: 


1. Construct the following machine M: “On input z € {0,1}* do: (a) 
Run N(0), (b) Return PAR(z)”. 


2. Return 1 — A(M). 


To complete the proof we need to show that B outputs the cor- 
rect answer, under our assumption that A computes MONOTONE. 
In other words, we need to show that HALTONZERO(N) = 1 — 
MONOTON E(M). Suppose that N does not halt on zero. In this 
case the program M constructed by Algorithm B enters into an in- 
finite loop in step (a) and will never reach step (b). Hence in this 
case N is functionally equivalent to INF. (The machine N is not 
the same machine as INF: its description or code is different. But it 
does have the same input/output behavior (in this case) of never 
halting on any input. Also, while the program M will go into an in- 
finite loop on every input, Algorithm B never actually runs M: it 
only produces its code and feeds it to A. Hence Algorithm B will 
not enter into an infinite loop even in this case.) Thus in this case, 
MONOTONE(M) = MONOTONE(INF) = 1. 


If N does halt on zero, then step (a) in M will eventually conclude 
and M’s output will be determined by step (b), where it simply out- 
puts the parity of its input. Hence in this case, M computes the non- 
monotone parity function (i.e., is functionally equivalent to PAR), and 
so we get that MONOTONE(M) = MONOTONE(PAR) = 0. In both 
cases, MONOTONE(M) = 1 — HALTONZEROV(N), which is what 
we wanted to prove. 

An examination of this proof shows that we did not use anything 
about MONOTONE beyond the fact that it is semantic and non-trivial. 
For every semantic non-trivial F, we can use the same proof, replacing 
PAR and INF with two machines Mo and M; such that F'(M)) = 0 and 
F'(M,) = 1. Such machines must exist if F is non-trivial. 

a 


® 


Remark 9.16 — Semantic is not the same as uncom- 
putable. Rice’s Theorem is so powerful and such a 
popular way of proving uncomputability that peo- 
ple sometimes get confused and think that it is the 
only way to prove uncomputability. In particular, a 
common misconception is that if a function F is not 
semantic then it is computable. This is not at all the 
case. 

For example, consider the following function 
HALTNOYALE : {0,1}* — {0,1}. This is a function 
that on input a string that represents a NAND-TM 
program P, outputs 1 if and only if both (i) P halts 
on the input 0, and (ii) the program P does not con- 
tain a variable with the identifier Yale. The function 
HALTNOYALE is clearly not semantic, as it will out- 
put two different values when given as input one of 
the following two functionally equivalent programs: 


Yale[0] = NAND(X[2],X[2]) 
Y[@] = NAND(X[Q],Yalel0J) 


and 


Harvard[@] = NAND(X[@],XL@]) 
Y[@] = NAND(X[0],Harvard[2]) 


However, HALTNOYALE is uncomputable since every 
program P can be transformed into an equivalent 
(and in fact improved :)) program P’ that does not 
contain the variable Yale. Hence if we could compute 
HALTNOYALE then determine halting on zero for 
NAND-IM programs (and hence for Turing machines 
as well). 

Moreover, as we will see in Chapter 11, there are un- 
computable functions whose inputs are not programs, 
and hence for which the adjective “semantic” is not 
applicable. 
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9.5.2 Halting and Rice’s Theorem for other Turing-complete models 


As we saw before, many natural computational models turn out to be 
equivalent to one another, in the sense that we can transform a “pro- 
gram” of one model (such as a \ expression, or a game-of-life config- 
urations) into another model (such as a NAND-TM program). This 
equivalence implies that we can translate the uncomputability of the 
Halting problem for NAND-TM programs into uncomputability for 
Halting in other models. For example: 


Theorem 9.17 — NAND-TM Machine Halting. Let NANDTMHALT 

{0,1}* — {0,1} be the function that on input strings P. € 
{0,1}*andz € {0,1}* outputs 1 if the NAND-TM program de- 
scribed by P halts on the input x and outputs 0 otherwise. Then 
NANDTMHALT is uncomputable. 


Proof. We have seen in Theorem 7.11 that for every Turing machine 
M, there is an equivalent NAND-IM program Py, such that for ev- 
ery x, Py;(z) = M(x). In particular this means that HALT(M) = 
NANDTMHALT(P,,). 

The transformation M ++ Py that is obtained from the proof 
of Theorem 7.11 is constructive. That is, the proof yields a way to 
compute the map M ++ Pm. This means that this proof yields a 
reduction from task of computing HALT to the task of computing 
NANDTMHALT, which means that since HALT is uncomputable, 
neither is NANDTMHALT. 
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The same proof carries over to other computational models such as 
the calculus, two dimensional (or even one-dimensional) automata etc. 
Hence for example, there is no algorithm to decide if a A expression 
evaluates the identity function, and no algorithm to decide whether 
an initial configuration of the game of life will result in eventually 
coloring the cell (0, 0) black or not. 

Indeed, we can generalize Rice’s Theorem to all these models. For 
example, if F : {0,1}* — {0,1} is a non-trivial function such that 
F(P) = F(P’) for every functionally equivalent NAND-TM programs 
P, P’ then F is uncomputable, and the same holds for NAND-RAM 
programs, \-expressions, and all other Turing complete models (as 
defined in Definition 8.5), see also Exercise 9.12. 


9.5.3 Is software verification doomed? (discussion) 

Programs are increasingly being used for mission critical purposes, 
whether it’s running our banking system, flying planes, or monitoring 
nuclear reactors. If we can’t even give a certification algorithm that 

a program correctly computes the parity function, how can we ever 

be assured that a program does what it is supposed to do? The key 
insight is that while it is impossible to certify that a general program 
conforms with a specification, it is possible to write a program in 

the first place in a way that will make it easier to certify. As a trivial 
example, if you write a program without loops, then you can certify 
that it halts. Also, while it might not be possible to certify that an 
arbitrary program computes the parity function, it is quite possible to 
write a particular program P for which we can mathematically prove 
that P computes the parity. In fact, writing programs or algorithms 
and providing proofs for their correctness is what we do all the time in 
algorithms research. 

The field of software verification is concerned with verifying that 
given programs satisfy certain conditions. These conditions can be 
that the program computes a certain function, that it never writes 
into a dangerous memory location, that is respects certain invari- 
ants, and others. While the general tasks of verifying this may be 
uncomputable, researchers have managed to do so for many inter- 
esting cases, especially if the program is written in the first place in 
a formalism or programming language that makes verification eas- 
ier. That said, verification, especially of large and complex programs, 
remains a highly challenging task in practice as well, and the num- 
ber of programs that have been formally proven correct is still quite 
small. Moreover, even phrasing the right theorem to prove (i.e., the 
specification) is often a highly non-trivial endeavor. 
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All functions F: {0,1}* > {0,1} 


x HALT 
R” 
HALTONZERO 
R: Computable functions 
* 
COMPUTES 
PARITY 


9.6 EXERCISES 


Exercise 9.1 — NAND-RAM Halt. Let NANDRAMHALT : {0,1}* — {0,1} 
be the function such that on input (P, x) where P represents a NAND- 
RAM program, NANDRAMHALT(P, x) = 1 iff P halts on the input zx. 
Prove that NANDRAMHALT is uncomputable. 


Exercise 9.2 — Timed halting. Let TIMEDHALT : {0,1}* —> {0,1}be 
the function that on input (a string representing) a triple (M, x, T), 


Figure 9.9: The set R of computable Boolean functions 
(Definition 7.3) is a proper subset of the set of all 
functions mapping {0, 1}* to {0, 1}. In this chapter 
we saw a few examples of elements in the latter set 
that are not in the former. 
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TIMEDHALT(M,2,T) = 1iff the Turing machine M, on input z, 
halts within at most T steps (where a step is defined as one sequence 
of reading a symbol from the tape, updating the state, writing a new 
symbol and (potentially) moving the head). 

Prove that TIMEDHALT is computable. 


Exercise 9.3 — Space halting (challenging). Let SPACEHALT : {0,1}* > 
{0, 1} be the function that on input (a string representing) a triple 
(M,x,T),SPACEHALT(M,2x,T) = 1 iff the Turing machine M, on 
input x, halts before its head reached the T-th location of its tape. (We 
don’t care how many steps M makes, as long as the head stays inside 
locations {0,..., T — 1}.) 

Prove that SPACEHALLT is computable. See footnote for hint? choices for the contents of the first T locations of 

a its tape. What happens if the machine repeats a 
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2 A machine with alphabet £ can have at most ||" 


previously seen configuration, in the sense that the 
Exercise 9.4 — Computable compositions. Suppose that F : {0, 1} *> {0, 1} tape contents, the head location, and the current state, 


and G : {0,1}* — {0,1} are computable functions. For each one of the 
following functions H, either prove that H is necessarily computable or 
give an example of a pair F and G of computable functions such that 


state of the execution? 


H will not be computable. Prove your assertions. 
1. A(x) =1iff F(z) =1ORG(a) =1. 


2. H(x) = 1iff there exist two non-empty strings u,v € {0,1}* such 
that x = uv (i.e., x is the concatenation of u and v), F(u) = 1 and 
G(v) =1. 


3. H(x) = 1 iff there exist a list ug, ...,u,_, of non-empty strings such 
that stringsF (u;) = 1 for every i € [t] and x = uguy, © uş- 


4. H(x) = 1iff xis a valid string representation of a NAND++ 
program P such that for every z € {0,1}*, on input z the program 
P outputs F(z). 


5. H(x) = liff xis a valid string representation of a NAND++ 
program P such that on input x the program P outputs F(z). 


6. H(x) = 1iff xis a valid string representation of a NAND++ 
program P such that on input x, P outputs F(x) after executing at 
most 100 - |x|? lines. 


Exercise 9.5 Prove that the following function FINITE : {0,1}* — {0,1} 
is uncomputable. On input P € {0,1}*, we define FINITE(P) = 1 
if and only if P is a string that represents a NAND++ program such 


3 Hint: Yı Rice’s Th 3 
that there only a finite number of inputs x € {0, 1}* s.t. P(x) = 1.3 A a 


are all identical to what they were in some previous 
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Exercise 9.6 — Computing parity. Prove Theorem 9.13 without using Rice’s 
Theorem. 


Exercise 9.7 — TM Equivalence. Let EQ : {0,1}* :—> {0,1} be the func- 
tion defined as follows: given a string representing a pair (M, M’) 
of Turing machines, EQ(M, M’) = 1iff M and M” are functionally 
equivalent as per Definition 9.14. Prove that EQ is uncomputable. 

Note that you cannot use Rice’s Theorem directly, as this theorem 
only deals with functions that take a single Turing machine as input, 
and EQ takes two machines. 


Exercise 9.8 For each of the following two functions, say whether it is 
computable or not: 


1. Given a NAND-IM program P, an input x, and a number k, when 
we run P on z, does the index variable i ever reach k? 


2. Given a NAND-IM program P, an input x, and a number k, when 
we run P on z, does P ever write to an array at index k? 


Exercise 9.9 Let F : {0,1}* — {0,1} be the function that is defined as 
follows. On input a string P that represents a NAND-RAM program 
and a String M that represents a Turing machine, F (P, M) = 1 if and 
only if there exists some input x such P halts on x but M does not halt 
on x. Prove that F is uncomputable. See footnote for hint.* 


Exercise 9.10 — Recursively enumerable. Define a function F : {0,1}* :> 
{0, 1} to be recursively enumerable if there exists a Turing machine M 
such that such that for every x € {0,1}*, if F(x) = 1 then M(x) = 1, 
and if F(x) = 0 then M(x) = L. (ie, if F(x) = 0 then M does not halt 


on z.) 
1. Prove that every computable F is also recursively enumerable. 


2. Prove that there exists F that is not computable but is recursively 
enumerable. See footnote for hint.” 


3. Prove that there exists a function F : {0,1}* > {0,1} such that F is 
not recursively enumerable. See footnote for hint.® 


4. Prove that there exists a function F : {0,1}* — {0,1} such that 
F is recursively enumerable but the function F defined as F(x) = 
1 — F(z) is not recursively enumerable. See footnote for hint.” 


* Hint: While it cannot be applied directly, with a 
little “massaging” you can prove this using Rice’s 
Theorem. 


° HALT has this property. 


6 You can either use the diagonalization method to 
prove this directly or show that the set of all recur- 
sively enumerable functions is countable. 


7 HALT has this property: show that if both HALT 
and HALT were recursively enumerable then HALT 
would be in fact computable. 


Exercise 9.11 — Rice’s Theorem: standard form. In this exercise we will 
prove Rice’s Theorem in the form that it is typically stated in the litera- 
ture. 

For a Turing machine M, define L(M) C {0,1}* to be the set of all 
x € {0,1}* such that M halts on the input x and outputs 1. (The set 
L(M) is known in the literature as the language recognized by M. Note 
that M might either output a value other than 1 or not halt at all on 
inputs x ¢ L(M). ) 


1. Prove that for every Turing machine M, if we define Fy, : {0,1}* > 
{0, 1} to be the function such that Fy(x) = 1 iff x£ € L(M) then Fy, 
is recursively enumerable as defined in Exercise 9.10. 


2. Use Theorem 9.15 to prove that for every G : {0,1}* —> {0,1}, if (a) 
G is neither the constant zero nor the constant one function, and 
(b) for every M, M’ such that L(M) = L(M’),G(M) = G(M’), 
then G is uncomputable. See footnote for hint.® 


Exercise 9.12 — Rice’s Theorem for general Turing-equivalent models (optional). 
Let F be the set of all partial functions from {0, 1}* to {0,1} and M : 
{0, 1}* + F be a Turing-equivalent model as defined in Definition 8.5. 
We define a function F : {0,1}* — {0,1} to be M-semantic if there 
exists some G : F — {0,1} such that F(P) = G(M(P)) for every 
P € {0,1}*. 

Prove that for every M-semantic F : {0,1}* + {0, 1} that is neither 
the constant one nor the constant zero function, F is uncomputable. 


Exercise 9.13 — Busy Beaver. In this question we define the NAND- 

TM variant of the busy beaver function (see Aaronson’s 1999 essay, 
2017 blog post and 2020 survey [ Aar20]; see also Tao’s highly recom- 
mended presentation on how civilization’s scientific progress can be 
measured by the quantities we can grasp). 


1. Let Tgp : {0,1}* — N be defined as follows. For every string P € 
{0, 1}*, if P represents a NAND-IM program such that when P is 
executed on the input 0 then it halts within M steps then Tgg(P) = 
M. Otherwise (if P does not represent a NAND-IM program, or it 
is a program that does not halt on 0), Tgg(P) = 0. Prove that Tz, 
is uncomputable. 


2. Let TOWER(n) denote the number 2 (that is, a “tower of pow- 
n times 
ers of two” of height n). To get a sense of how fast this function 
grows, TOWER(1) = 2, TOWER(2) = 2? = 4, TOWER(3) = 2? = 
16, TOWER(4) = 2!8 = 65536 and TOWER(5) = 265536 which 
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8 Show that any G satisfying (b) must be semantic. 


356 INTRODUCTION TO THEORETICAL COMPUTER SCIENCE 


is about 1070°°°. TOWER(6) is already a number that is too big to 
write even in scientific notation. Define NBB : N > N (for “NAND- 
TM Busy Beaver”) to be the function NBB(n) = max pe{o,1}» Tgg(P) 
where Tgp is as defined in Question 6.1. Prove that NBB grows 
faster than TOWER, in the sense that TOWER(n) = o(NBB(n)). See 
footnote for hint? 


9.7 BIBLIOGRAPHICAL NOTES 


The cartoon of the Halting problem in Fig. 9.1 and taken from Charles 
Cooper’s website, Copyright 2019 Charles F. Cooper. 

Section 7.2 in [MM11] gives a highly recommended overview of 
uncomputability. Gödel, Escher, Bach [Hof99] is a classic popular 
science book that touches on uncomputability, and unprovability, and 
specifically Gddel’s Theorem that we will see in Chapter 11. See also 
the recent book by Holt [Hol18]. 

The history of the definition of a function is intertwined with the 
development of mathematics as a field. For many years, a function 
was identified (as per Euler’s quote above) with the means to calcu- 
late the output from the input. In the 1800’s, with the invention of 
the Fourier series and with the systematic study of continuity and 
differentiability, people have started looking at more general kinds of 
functions, but the modern definition of a function as an arbitrary map- 
ping was not yet universally accepted. For example, in 1899 Poincare 
wrote “we have seen a mass of bizarre functions which appear to be forced 
to resemble as little as possible honest functions which serve some purpose. 

... they are invented on purpose to show that our ancestor's reasoning was at 
fault, and we shall never get anything more than that out of them”. Some of 
this fascinating history is discussed in [Gra83; Kle91; Liit02; Gra05]. 

The existence of a universal Turing machine, and the uncomputabil- 
ity of HALT was first shown by Turing in his seminal paper [Tur37], 
though closely related results were shown by Church a year before. 
These works built on Gédel’s 1931 incompleteness theorem that we will 
discuss in Chapter 11. 

Some universal Turing machines with a small alphabet and number 
of states are given in [Rog96], including a single-tape universal Turing 
machine with the binary alphabet and with less than 25 states; see 
also the survey [WN09]. Adam Yedidia has written software to help 
in producing Turing machines with a small number of states. This is 
related to the recreational pastime of “Code Golfing” which is about 
solving a certain computational task using the as short as possible 
program. Finding “highly complex” small Turing machine is also 


? You will not need to use very specific properties 
of the TOWER function in this exercise. For exam- 
ple, NBB(n) also grows faster than the Ackerman 
function. 
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related to the “Busy Beaver” problem, see Exercise 9.13 and the survey 
[ Aar20]. 

The diagonalization argument used to prove uncomputability of F* 
is derived from Cantor’s argument for the uncountability of the reals 
discussed in Chapter 2. 

Christopher Strachey was an English computer scientist and the 
inventor of the CPL programming language. He was also an early 
artificial intelligence visionary, programming a computer to play 
Checkers and even write love letters in the early 1950’s, see this New 
Yorker article and this website. 

Rice’s Theorem was proven in [Ric53]. It is typically stated in a 
form somewhat different than what we used, see Exercise 9.11. 

We do not discuss in the chapter the concept of recursively enumer- 
able languages, but it is covered briefly in Exercise 9.10. As usual, we 
use function, as opposed to language, notation. 
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Learning Objectives: 


e See that Turing completeness is not always a 
good thing. 
Another example of an always-halting 
formalism: context-free grammars and simply 
typed calculus. 
The pumping lemma for non context-free 
functions. 
Examples of computable and uncomputable 


1 0 semantic properties of regular expressions and 
context-free grammars. 


Restricted computational models 


“Happy families are all alike; every unhappy family is unhappy in its own 
way”, Leo Tolstoy (opening of the book “Anna Karenina”). 


We have seen that many models of computation are Turing equiva- 
lent, including Turing machines, NAND-IM/NAND-RAM programs, 
standard programming languages such as C/Python/Javascript, as 
well as other models such as the à calculus and even the game of life. 
The flip side of this is that for all these models, Rice’s theorem (The- 
orem 9.15) holds as well, which means that any semantic property of 
programs in such a model is uncomputable. 

The uncomputability of halting and other semantic specification 
problems for Turing equivalent models motivates restricted com- 
putational models that are (a) powerful enough to capture a set of 
functions useful for certain applications but (b) weak enough that we 
can still solve semantic specification problems on them. In this chapter 
we discuss several such examples. 


10.1 TURING COMPLETENESS AS A BUG 


We have seen that seemingly simple computational models or sys- 
tems can turn out to be Turing complete. The following webpage lists 
several examples of formalisms that “accidentally” turned out to Tur- 
ing complete, including supposedly limited languages such as the C 
preprocessor, CSS, (certain variants of) SQL, sendmail configuration, 
as well as games such as Minecraft, Super Mario, and the card game 


Compiled on 12.19.2022 22:58 
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“Magic: The Gathering”. Turing completeness is not always a good 
thing, as it means that such formalisms can give rise to arbitrarily 
complex behavior. For example, the postscript format (a precursor of 
PDF) is a Turing-complete programming language meant to describe 
documents for printing. The expressive power of postscript can allow 
for short descriptions of very complex images, but it also gave rise to 
some nasty surprises, such as the attacks described in this page rang- 
ing from using infinite loops as a denial of service attack, to accessing 
the printer’s file system. 


m Example 10.1 — The DAO Hack. An interesting recent example of the 
pitfalls of Turing-completeness arose in the context of the cryp- 
tocurrency Ethereum. The distinguishing feature of this currency 
is the ability to design “smart contracts” using an expressive (and 
in particular Turing-complete) programming language. In our 
current “human operated” economy, Alice and Bob might sign a 
contract to agree that if condition X happens then they will jointly 
invest in Charlie’s company. Ethereum allows Alice and Bob to 
create a joint venture where Alice and Bob pool their funds to- 
gether into an account that will be governed by some program P 
that decides under what conditions it disburses funds from it. For 
example, one could imagine a piece of code that interacts between 
Alice, Bob, and some program running on Bob’s car that allows 
Alice to rent out Bob’s car without any human intervention or 
overhead. 

Specifically Ethereum uses the Turing-complete programming 
language solidity which has a syntax similar to JavaScript. The 
flagship of Ethereum was an experiment known as The “Decen- 
tralized Autonomous Organization” or The DAO. The idea was 
to create a smart contract that would create an autonomously run 
decentralized venture capital fund, without human managers, 
where shareholders could decide on investment opportunities. The 


Figure 10.1: Some restricted computational models. 
We have already seen two equivalent restricted 
models of computation: regular expressions and 
deterministic finite automata. We show a more 
powerful model: context-free grammars. We also 
present tools to demonstrate that some functions can 
not be computed in these models. 


RESTRICTED COMPUTATIONAL MODELS 


DAO was at the time the biggest crowdfunding success in history. 
At its height the DAO was worth 150 million dollars, which was 
more than ten percent of the total Ethereum market. Investing in 
the DAO (or entering any other “smart contract”) amounts to pro- 
viding your funds to be run by a computer program. i.e., “code 

is law”, or to use the words the DAO described itself: “The DAO 

is borne from immutable, unstoppable, and irrefutable computer code”. 
Unfortunately, it turns out that (as we saw in Chapter 9) under- 
standing the behavior of computer programs is quite a hard thing 
to do. A hacker (or perhaps, some would say, a savvy investor) 
was able to fashion an input that caused the DAO code to enter 
into an infinite recursive loop in which it continuously transferred 
funds into the hacker’s account, thereby cleaning out about 60 mil- 
lion dollars out of the DAO. While this transaction was “legal” in 
the sense that it complied with the code of the smart contract, it 
was obviously not what the humans who wrote this code had in 
mind. The Ethereum community struggled with the response to 
this attack. Some tried the “Robin Hood” approach of using the 
same loophole to drain the DAO funds into a secure account, but 
it only had limited success. Eventually, the Ethereum community 
decided that the code can be mutable, stoppable, and refutable. 
Specifically, the Ethereum maintainers and miners agreed on a 
“hard fork” (also known as a “bailout”) to revert history to be- 
fore the hacker’s transaction occurred. Some community members 
strongly opposed this decision, and so an alternative currency 
called Ethereum Classic was created that preserved the original 
history. 


10.2 CONTEXT FREE GRAMMARS 


If you have ever written a program, you've experienced a syntax error. 
You probably also had the experience of your program entering into 
an infinite loop. What is less likely is that the compiler or interpreter 
entered an infinite loop while trying to figure out if your program has 
a syntax error. 

When a person designs a programming language, they need to 
determine its syntax. That is, the designer decides which strings corre- 
sponds to valid programs, and which ones do not (i.e., which strings 
contain a syntax error). To ensure that a compiler or interpreter al- 
ways halts when checking for syntax errors, language designers typi- 
cally do not use a general Turing-complete mechanism to express their 
syntax. Rather they use a restricted computational model. One of the 
most popular choices for such models is context free grammars. 
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To explain context free grammars, let us begin with a canonical ex- 
ample. Consider the function ARITH : £* — {0,1} that takes as input 
a string x over the alphabet © = {(,),+,—, x, +,0,1,2,3,4,5,6,7,8, 9} 
and returns 1 if and only if the string x represents a valid arithmetic 
expression. Intuitively, we build expressions by applying an opera- 
tion such as +,—,x or + to smaller expressions, or enclosing them in 
parentheses, where the “base case” corresponds to expressions that 
are simply numbers. More precisely, we can make the following defi- 
nitions: 


e A digit is one of the symbols 0, 1, 2,3, 4, 5, 6,7, 8, 9. 


e Anumber is a sequence of digits. (For simplicity we drop the condi- 
tion that the sequence does not have a leading zero, though it is not 
hard to encode it in a context-free grammar as well.) 


e An operation is one of +,—, x, + 


e An expression has either the form “number”, the form “sub- 
expression operation sub-expression2”, or the form “ (sub- 
expression1)”, where “sub-expression1” and “sub-expression2” are 
themselves expressions. (Note that this is a recursive definition.) 


A context free grammar (CFG) is a formal way of specifying such 
conditions. A CFG consists of a set of rules that tell us how to generate 
strings from smaller components. In the above example, one of the 
rules is “if exp1 and exp2 are valid expressions, then exp1 x exp2 is 
also a valid expression”; we can also write this rule using the short- 
hand expression = expression x expression. As in the above ex- 
ample, the rules of a context-free grammar are often recursive: the rule 
expression = expression X expression defines valid expressions in 
terms of itself. We now formally define context-free grammars: 


Definition 10.2 — Context Free Grammar. Let © be some finite set. A 
context free grammar (CFG) over X is a triple (V, R, s) such that: 


e V, known as the variables, is a set disjoint from X. 
e s € V is known as the initial variable. 


e Risa set of rules. Each rule is a pair (v,z)withv € V and 
z € (£ U V)*. We often write the rule (v, z) as v > z and say that 
the string z can be derived from the variable v. 
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m Example 10.3 — Context free grammar for arithmetic expressions. The 
example above of well-formed arithmetic expressions can be cap- 
tured formally by the following context free grammar: 


e The alphabet © is {(,),+,—, x, +, 0, 1, 2,3, 4,5, 6,7, 8,9} 
e The variables are V = {expression , number , digit , operation}. 


e The rules are the set R containing the following 19 rules: 


— The 4 rules operation = +, operation = —, operation = x, 
and operation => +. 


— The 10 rules digit > 0,..., digit => 9. 

— The rule number > digit. 

— The rule number => digit number. 

— The rule expression => number. 

— The rule expression => expression operation expression. 


- The rule expression = (expression). 
e The starting variable is expression 


People use many different notations to write context free grammars. 
One of the most common notations is the Backus—Naur form. In this 
notation we write a rule of the form v => a (where v is a variable and a 
is a string) in the form <v> := a. If we have several rules of the form 
v > a, v H> b, and v > c then we can combine them as <v> := alb|c. 
(In words we say that v can derive either a, b, or c.) For example, the 
Backus-Naur description for the context free grammar of Example 10.3 
is the following (using ASCII equivalents for operations): 


operation := +|-|*|/ 

digit := 0|1|2/314[516|71819 

number := digit|digit number 
expression := number|expression operation 


ə expression| (expression) 


Another example of a context free grammar is the “matching paren- 
theses” grammar, which can be represented in Backus-Naur as fol- 
lows: 


match := ""|match match| (match) 


A string over the alphabet { (,) } can be generated from this gram- 
mar (where match is the starting expression and "" corresponds to the 
empty string) if and only if it consists of a matching set of parentheses. 
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In contrast, by Lemma 6.20 there is no regular expression that matches 
a string «x if and only if x contains a valid sequence of matching paren- 
theses. 


10.2.1 Context-free grammars as a computational model 

We can think of a context-free grammar over the alphabet » as defin- 
ing a function that maps every string x in &* to 1 or 0 depending on 
whether x can be generated by the rules of the grammars. We now 
make this definition formally. 


Definition 10.4 — Deriving a string froma grammar. IfG = (V,R,s)isa 
context-free grammar over ©, then for two strings aœ, 8 € (X U VY 
we say that p can be derived in one step from a, denoted bya >g £8, 
if we can obtain £ from a by applying one of the rules of G. That is, 
we obtain £ by replacing in a one occurrence of the variable v with 
the string z, where v > zis a rule of G. 

We say that p can be derived from a, denoted bya =%ġ _ 3, ifit 
can be derived by some finite number k of steps. That is, if there 


are Q1,...,Q,_, E (© U V)*,sothata >g a, >g A >e >g 
Qapı Fab. 

We say thatx € X* is matched by Œ = (V,R,s) if x can be de- 
rived from the starting variable s (i.e. ifs =% x). We define the 
function computed by (V, R, s) to be the map y rs : &* — {0,1} 
such that ®y g (£) = 1iff xis matched by (V, R, s). A function 


F : &* — {0,1} is context free if F = ®y p s for some CFG (V, R, s). 
1 


A priori it might not be clear that the map Ëy pg , is computable, 
but it turns out that this is the case. 


Theorem 10.5 — Context-free grammars always halt. For every CFG 
(V, R, s) over {0, 1}, the function ®y rs : {0,1} — {0,1}is 
computable. 


As usual we restrict attention to grammars over {0, 1} although the 
proof extends to any finite alphabet X. 


Proof. We only sketch the proof. We start with the observation we can 
convert every CFG to an equivalent version of Chomsky normal form, 
where all rules either have the form u — vw for variables u, v, w or the 
form u — o fora variable u and symbol o € X, plus potentially the 


rule s + "" where s is the starting variable. 
The idea behind such a transformation is to simply add new vari- 
ables as needed, and so for example we can translate a rule such as 


v — uow into the three rules v > ur, r —> tw and t > v. 


1 As in the case of Definition 6.7 we can also use 
language rather than function notation and say that a 
language L C %* is context free if the function F' such 
that F(x) = 1 iff x € L is context free. 
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Using the Chomsky Normal form we get a natural recursive algo- 
rithm for computing whether s =>% {x for a given grammar G and 
string z. We simply try all possible guesses for the first rule s —> uv 
that is used in such a derivation, and then all possible ways to par- 
tition x as a concatenation x = x’x”. If we guessed the rule and the 
partition correctly, then this reduces our task to checking whether 
u=>G@ x andv =>ġ x”, which (as it involves shorter strings) can 
be done recursively. The base cases are when z is empty or a single 
symbol, and can be easily handled. 


Q 


10.2.2 The power of context free grammars 
Context free grammars can capture every regular expression: 


Theorem 10.7 — Context free grammars and regular expressions. Let e be a 
regular expression over {0, 1}, then there is a CFG (V, R, s) over 
{0, 1} such that V R.s — ®.. 


Proof. We prove the theorem by induction on the length of e. If e is 
an expression of one bit length, then e = 0 or e = 1, in which case 
we leave it to the reader to verify that there is a (trivial) CFG that 
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computes it. Otherwise, we fall into one of the following case: case 1: 
e = e'e”, case 2: e = e’ |e” or case 3: e = (e’)* where in all cases e’, e” 
are shorter regular expressions. By the induction hypothesis, we can 
define grammars (V’, R’, s’) and (V”, R”, s”) that compute ®,, and 
®., respectively. By renaming variables, we can also assume without 
loss of generality that V’ and V” are disjoint. 

In case 1, we can define the new grammar as follows: we add a new 
starting variable s ¢ V UV’ and the rule s œ s's”. In case 2, we can 
define the new grammar as follows: we add a new starting variable 
s ¢ V UV’ andthe rules s |œ s’ ands > s”. Case 3 will be the 
only one that uses recursion. As before we add a new starting variable 
s € V UV’, but now add the rules s œ "" (i.e., the empty string) and 
also add, for every rule of the form (s’,a) € R’, the rule s > sa to R. 

We leave it to the reader as (a very good!) exercise to verify that in 
all three cases the grammars we produce capture the same function as 
the original expression. 

a 


It turns out that CFG’s are strictly more powerful than regular 
expressions. In particular, as we’ve seen, the “matching parentheses” 
function MATCHPAREN can be computed by a context free grammar, 
whereas, as shown in Lemma 6.20, it cannot be computed by regular 
expressions. Here is another example: 


Solved Exercise 10.1 — Context free grammar for palindromes. Let PAL : 
{0,1,;}* > {0,1} be the function defined in Solved Exercise 6.4 where 
PAL(w) = 1 iff w has the form u; u”. Then PAL can be computed by a 
context-free grammar 


Solution: 
A simple grammar computing PAL can be described using 
Backus-Naur notation: 


start := ; | © start @ | 1 start 1 


One can prove by induction that this grammar generates exactly 
the strings w such that PAL(w) = 1. 


A more interesting example is computing the strings of the form 
u; v that are not palindromes: 


Solved Exercise 10.2 — Non-palindromes. Prove that there is a context free 
grammar that computes NPAL : {0,1,;}* — {0,1} where NPAL(w) = 
lif w= u; v but v £ uë. 


Solution: 
Using Backus—Naur notation we can describe such a grammar as 
follows 


palindrome := ; | @ palindrome ð | 1 palindrome 1 
different 
start 

> 0 | start 1 


@ palindrome 1 | 1 palindrome ð 
different | ð start | 1 start | start 


In words, this means that we can characterize a string w such 
that NPAL(w) = 1 as having the following form 


w = abu; u®b’B 


where a, 8, u are arbitrary stringsandb + b’. Hence we can 
generate such a string by first generating a palindrome u; u? 
(palindrome variable), then adding 0 on either the left or right and 
1 on the opposite side to get something that is not a palindrome 
(different variable), and then we can add arbitrary number of 0’s 


and 1’s on either end (the start variable). 


10.2.3 Limitations of context-free grammars (optional) 

Even though context-free grammars are more powerful than regular 
expressions, there are some simple languages that are not captured 
by context free grammars. One tool to show this is the context-free 
grammar analog of the “pumping lemma” (Theorem 6.21): 


Theorem 10.8 — Context-free pumping lemma. Let (V, R, s) be a CFG 

over X, then there is some numbers no, nı E N such that for every 
x € &* with |z| > ny, if ®y p(x) = 1thenx = abcde such that 
lol + |e| + |d| < nı, |b] + |d| > 1, and Oy g, ,(ab*cd*e) = 1 for every 
KEN. 


Proof of Theorem 10.8. We only sketch the proof. The idea is that if 
the total number of symbols in the rules of the grammar is ng, then 


the only way to get |x| > ngo with ®y p (x) = 1 is to use recursion. 
That is, there must be some variable v € V such that we are able to 
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derive from v the value bud for some strings b, d € &*, and then further 
on derive from v some string c € &* such that bed is a substring of 

x (in other words, x = abcde for some a,e € {0,1}*). If we take 

the variable v satisfying this requirement with a minimum number 

of derivation steps, then we can ensure that |bcd| is at most some 
constant depending on ng and we can set n, to be that constant (n, = 
10 - |R] - ng will do, since we will not need more than |R] applications 
of rules, and each such application can grow the string by at most ng 
symbols). 

Thus by the definition of the grammar, we can repeat the derivation 
to replace the substring bcd in x with b*cd* for every k € N while 
retaining the property that the output of ®y p , is still one. Since bed 
is a substring of x, we can write x = abcde and are guaranteed that 
ab*cd*e is matched by the grammar for every k. 


Using Theorem 10.8 one can show that even the simple function 
F : {0,1}* > {0,1} defined as follows: 


1 x= ww for some w € {0,1}* 
F(x) = 
0 otherwise 


is not context free. (In contrast, the function G : {0,1}* — {0,1} 
defined as G(x) = Liffx = wgw = Wp 1Wn—1Wn—2 Wo for some 
w € {0,1}* and n = |w| is context free, can you see why?.) 


Solved Exercise 10.3 — Equality is not context-free. Let EQ : {0,1,;}* > 
{0, 1} be the function such that EQ(x) = 1 if and only if x = u;u for 
some u € {0,1}*. Then EQ is not context free. 


Solution: 

We use the context-free pumping lemma. Suppose towards the 
sake of contradiction that there is a grammar G that computes EQ, 
and let nọ be the constant obtained from Theorem 10.8. 

Consider the string x = 1"°0"0;1"00", and write it as x = abcde 
as per Theorem 10.8, with |bcd| < mg and with |b| + |d| > 1. By The- 
orem 10.8, it should hold that EQ(ace) = 1. However, by case anal- 
ysis this can be shown to be a contradiction. 

Firstly, unless b is on the left side of the ; separator and d is on 
the right side, dropping b and d will definitely make the two parts 
different. But if it is the case that b is on the left side and d is on the 
right side, then by the condition that |bcd| < ng we know that b is a 
string of only zeros and d is a string of only ones. If we drop b and 
d then since one of them is non-empty, we get that there are either 
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less zeroes on the left side than on the right side, or there are less 
ones on the right side than on the left side. In either case, we get 
that EQ(ace) = 0, obtaining the desired contradiction. 


10.3 SEMANTIC PROPERTIES OF CONTEXT FREE LANGUAGES 


As in the case of regular expressions, the limitations of context free 
grammars do provide some advantages. For example, emptiness of 
context free grammars is decidable: 


Theorem 10.9 — Emptiness for CFG’s is decidable. There is an algorithm 
that on input a context-free grammar G, outputs 1 if and only if 6, 
is the constant zero function. 


Proof Idea: 

The proof is easier to see if we transform the grammar to Chomsky 
Normal Form as in Theorem 10.5. Given a grammar G, we can recur- 
sively define a non-terminal variable v to be non-empty if there is either 
a rule of the form v > g, or there is a rule of the form v => uw where 
both u and w are non-empty. Then the grammar is non-empty if and 
only if the starting variable s is non-empty. 

x 


Proof of Theorem 10.9. We assume that the grammar G in Chomsky 
Normal Form as in Theorem 10.5. We consider the following proce- 
dure for marking variables as “non-empty”: 


1. We start by marking all variables v that are involved in a rule of the 
form v > o as non-empty. 


2. We then continue to mark v as non-empty if it is involved in a rule 
of the form v => uw where u, w have been marked before. 


We continue this way until we cannot mark any more variables. We 
then declare that the grammar is empty if and only if s has not been 
marked. To see why this is a valid algorithm, note that if a variable v 
has been marked as “non-empty” then there is some string a € X* that 
can be derived from v. On the other hand, if v has not been marked, 
then every sequence of derivations from v will always have a variable 
that has not been replaced by alphabet symbols. Hence in particular 
®q is the all zero function if and only if the starting variable s is not 
marked “non-empty”. 
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10.3.1 Uncomputability of context-free grammar equivalence (optional) 
By analogy to regular expressions, one might have hoped to get an 
algorithm for deciding whether two given context free grammars 
are equivalent. Alas, no such luck. It turns out that the equivalence 
problem for context free grammars is uncomputable. This is a direct 
corollary of the following theorem: 


Theorem 10.10 — Fullness of CFG’s is uncomputable. For every set J, let 
CFGFULLy be the function that on input a context-free grammar G 
over X, outputs 1 if and only if G computes the constant 1 function. 
Then there is some finite © such that CFGFULL, is uncomputable. 


Theorem 10.10 immediately implies that equivalence for context- 
free grammars is uncomputable, since computing “fullness” of a 
grammar G over some alphabet © = {09,...,0,_,} corresponds to 
checking whether G is equivalent to the grammar s > "" 
Note that Theorem 10.10 and Theorem 10.9 together imply that 


context-free grammars, unlike regular expressions, are not closed 


[soo| = |So,,_1. 


under complement. (Can you see why?) Since we can encode every 
element of X using [log ||] bits (and this finite encoding can be easily 
carried out within a grammar) Theorem 10.10 implies that fullness is 
also uncomputable for grammars over the binary alphabet. 


Proof Idea: 

We prove the theorem by reducing from the Halting problem. To 
do that we use the notion of configurations of NAND-TM programs, as 
defined in Definition 8.8. Recall that a configuration of a program P is a 
binary string s that encodes all the information about the program in 
the current iteration. 

We define © to be {0,1} plus some separator characters and define 
INVALID p : &* — {0, 1} to be the function that maps every string L € 
&* to 1 if and only if L does not encode a sequence of configurations 
that correspond to a valid halting history of the computation of P on 
the empty input. 

The heart of the proof is to show that INVALID p is context-free. 
Once we do that, we see that P halts on the empty input if and only if 
INVALID p(L) = 1 for every L. To show that, we will encode the list 
in a special way that makes it amenable to deciding via a context-free 
grammar. Specifically we will reverse all the odd-numbered strings. 

* 


Proof of Theorem 10.10. We only sketch the proof. We will show that if 
we can compute CFGFULL then we can solve HALTONZERO, which 
has been proven uncomputable in Theorem 9.9. Let M be an input 
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Turing machine for HALTONZERO. We will use the notion of configu- 
rations of a Turing machine, as defined in Definition 8.8. 

Recall that a configuration of Turing machine M and input «x cap- 
tures the full state of M at some point of the computation. The partic- 
ular details of configurations are not so important, but what you need 
to remember is that: 


e A configuration can be encoded by a binary string o € {0,1}*. 
e The initial configuration of M on the input 0 is some fixed string. 


e A halting configuration will have the value a certain state (which can 
be easily “read off” from it) set to 1. 


e Ifc is a configuration at some step i of the computation, we denote 
by NEXT ,;(c) as the configuration at the next step. NEXT ,,;(c) is 
a string that agrees with o on all but a constant number of coor- 
dinates (those encoding the position corresponding to the head 
position and the two adjacent ones). On those coordinates, the 
value of NEXT ,,(a) can be computed by some finite function. 


We will let the alphabet © = {0,1} U {||;, #}. A computation his- 
tory of M on the input 0 is a string L € X that corresponds to a list 
looo lOn#03 oial, 1# (ie. || comes before an even numbered 
block, and # comes before an odd numbered one) such that if i is 
even then çg; is the string encoding the configuration of P on input 0 
at the beginning of its i-th iteration, and if i is odd then it is the same 
except the string is reversed. (That is, for odd i, rev(a,;) encodes the 
configuration of P on input 0 at the beginning of its i-th iteration.) 
Reversing the odd-numbered blocks is a technical trick to ensure that 
the function INVALID y we define below is context free. 

We now define INVALID q : &* — {0,1} as follows: 


0 Lisa valid computation history of M on 0 
INVALID y(L) = 
1 otherwise 


We will show the following claim: 

CLAIM: INVALID ,, is context-free. 

The claim implies the theorem. Since M halts on 0 if and only if 
there exists a valid computation history, INVALID y is the constant 
one function if and only if M does not halt on 0. In particular, this 
allows us to reduce determining whether M halts on 0 to determining 
whether the grammar G y corresponding to INVALID y is full. 

We now turn to the proof of the claim. We will not show all the 
details, but the main point INVALID (L) = 1 if at least one of the 
following three conditions hold: 
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1. Lis not of the right format, i.e. not of the form (binary-string) # (binary-string) | (binary-string) # ---. 


2. L contains a substring of the form ||7#0’|| such that 
a’ + rev(NEXTp(o)) 


3. L contains a substring of the form #o|o’# such that 
a’ + NEXT p(rev(c)) 


Since context-free functions are closed under the OR operation, the 
claim will follow if we show that we can verify conditions 1, 2 and 3 
via a context-free grammar. 

For condition 1 this is very simple: checking that L is of the correct 
format can be done using a regular expression. Since regular expres- 
sions are closed under negation, this means that checking that L is not 
of this format can also be done by a regular expression and hence by a 
context-free grammar. 

For conditions 2 and 3, this follows via very similar reasoning to 
that showing that the function F such that F (u#v) = 1 iff u # rev(v) 
is context-free, see Solved Exercise 10.2. After all, the NEXT y function 
only modifies its input in a constant number of places. We leave filling 
out the details as an exercise to the reader. Since INVALID y(L) = 1 
if and only if L satisfies one of the conditions 1., 2. or 3., and all three 
conditions can be tested for via a context-free grammar, this completes 
the proof of the claim and hence the theorem. 

a 


10.4 SUMMARY OF SEMANTIC PROPERTIES FOR REGULAR EX- 
PRESSIONS AND CONTEXT-FREE GRAMMARS 


To summarize, we can often trade expressiveness of the model for 
amenability to analysis. If we consider computational models that are 
not Turing complete, then we are sometimes able to bypass Rice’s The- 
orem and answer certain semantic questions about programs in such 
models. Here is a summary of some of what is known about semantic 
questions for the different models we have seen. 


Table 10.1: Computability of semantic properties 


Model Halting Emptiness Equivalence 


Regular expressions Computable Computable Computable 
Context free grammars Computable Computable | Uncomputable 
Turing-complete models UncomputableUncomputable Uncomputable 


10.5 EXERCISES 


Exercise 10.1 — Closure properties of context-free functions. Suppose that 
F,G : {0,1}* — {0,1} are context free. For each one of the following 
definitions of the function H, either prove that H is always context 


free or give a counterexample for regular F’, G that would make H not 


context free. 


1. 


2. 


A(x) = F(x) V G(x). 
H(x) = F(x) A G(x) 
. H(x) = NAND(F (x), G(2)). 

H (x) = F(x®) where z” is the reverse of z: x? = £p _18p—2'" Zo for 

n = |x| 

je z= ws.t. F(u) = G(v) =1 
0 otherwise 

oE f x=uust. F(u) = G(u)=1 
0 otherwise 

T l 1 c=wi st. F(u) =G(u)=1 
0 otherwise 


Exercise 10.2 Prove that the function F : {0,1}* — {0,1} such that 
F(x) = 1 if and only if |x| is a power of two is not context free. 
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Exercise 10.3 — Syntax for programming languages. Consider the following 
syntax of a “programming language” whose source can be written 
using the ASCII character set: 


e Variables are obtained by a sequence of letters, numbers and under- 
scores, but can’t start with a number. 


e A statement has either the form foo = bar; where foo and bar are 
variables, or the form IF (foo) BEGIN ... END where... is list 
of one or more statements, potentially separated by newlines. 


A program in our language is simply a sequence of statements (pos- 
sibly separated by newlines or spaces). 


1. Let VAR : {0,1}* — {0,1} be the function that given a string 
x € {0,1}*, outputs 1 if and only if x corresponds to an ASCII 
encoding of a valid variable identifier. Prove that VAR is regular. 


2. Let SYN : {0,1}* — {0,1} be the function that given a string 
s € {0,1}*, outputs 1 if and only if s is an ASCII encoding of a valid 
program in our language. Prove that SYN is context free. (You do 
not have to specify the full formal grammar for SYN, but you need 
to show that such a grammar exists. ) 


3. Prove that SYN is not regular. See footnote for hint? 


10.6 BIBLIOGRAPHICAL NOTES 


As in the case of regular expressions, there are many resources avail- 
able that cover context-free grammar in great detail. Chapter 2 of 
[Sip97] contains many examples of context-free grammars and their 
properties. There are also websites such as Grammophone where you 
can input grammars, and see what strings they generate, as well as 
some of the properties that they satisfy. 

The adjective “context free” is used for CFG’s because a rule of 
the form v > a means that we can always replace v with the string 
a, no matter what is the context in which v appears. More generally, 
we might want to consider cases where the replacement rules depend 
on the context. This gives rise to the notion of general (aka “Type 0”) 
grammars that allow rules of the form a = b where both a and b are 
strings over (V U £)*. The idea is that if, for example, we wanted to 
enforce the condition that we only apply some rule suchas v œ 0w1 
when v is surrounded by three zeroes on both sides, then we could do 
so by adding a rule of the form 000v000 + 0000w1000 (and of course 
we can add much more general conditions). Alas, this generality 


? Try to see if you can “embed” in some way a func- 
tion that looks similar to MATCHPAREN in SYN, so 
you can use a similar proof. Of course for a function 
to be non-regular, it does not need to utilize literal 
parentheses symbols. 
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comes at a cost - general grammars are Turing complete and hence 
their halting problem is uncomputable. That is, there is no algorithm 
A that can determine for every general grammar G and a string x, 
whether or not the grammar G generates x. 

The Chomsky Hierarchy is a hierarchy of grammars from the least 
restrictive (most powerful) Type 0 grammars, which correspond to 
recursively enumerable languages (see Exercise 9.10) to the most re- 
strictive Type 3 grammars, which correspond to regular languages. 
Context-free languages correspond to Type 2 grammars. Type 1 gram- 
mars are context sensitive grammars. These are more powerful than 
context-free grammars but still less powerful than Turing machines. 
In particular functions/languages corresponding to context-sensitive 
grammars are always computable, and in fact can be computed by a 
linear bounded automatons which are non-deterministic algorithms 
that take O(n) space. For this reason, the class of functions/languages 
corresponding to context-sensitive grammars is also known as the 
complexity class NSPACEO(n); we discuss space-bounded com- 
plexity in Chapter 17). While Rice’s Theorem implies that we cannot 
compute any non-trivial semantic property of Type 0 grammars, the 
situation is more complex for other types of grammars: some seman- 
tic properties can be determined and some cannot, depending on the 
grammar’s place in the hierarchy. 
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Learning Objectives: 


e More examples of uncomputable functions 
that are not as tied to computation. 


e Gédel’s incompleteness theorem - a result 
that shook the world of mathematics in the 
early 20th century. 


11 
Is every theorem provable? 


“Take any definite unsolved problem, such as ... the existence of an infinite 
number of prime numbers of the form 2” + 1. However unapproachable these 
problems may seem to us and however helpless we stand before them, we have, 
nevertheless, the firm conviction that their solution must follow by a finite 
number of purely logical processes...” 

“...This conviction of the solvability of every mathematical problem is a pow- 
erful incentive to the worker. We hear within us the perpetual call: There is the 
problem. Seek its solution. You can find it by pure reason, for in mathematics 
there is no ignorabimus.”, David Hilbert, 1900. 


“The meaning of a statement is its method of verification.”, Moritz Schlick, 
1938 (aka “The verification principle” of logical positivism) 


The problems shown uncomputable in Chapter 9, while natural 
and important, still intimately involved NAND-TM programs or other 
computing mechanisms in their definitions. One could perhaps hope 
that as long as we steer clear of functions whose inputs are themselves 
programs, we can avoid the “curse of uncomputability”. Alas, we have 
no such luck. 

In this chapter we will see an example of a natural and seemingly 
“computation free” problem that nevertheless turns out to be uncom- 
putable: solving Diophantine equations. As a corollary, we will see 
one of the most striking results of 20th century mathematics: Gédel’s 
Incompleteness Theorem, which showed that there are some mathemat- 
ical statements (in fact, in number theory) that are inherently unprov- 
able. We will actually start with the latter result, and then show the 
former. 
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11.1 HILBERT’S PROGRAM AND GODEL’S INCOMPLETENESS 
THEOREM 
“And what are these ...vanishing increments? They are neither finite quanti- 


ties, nor quantities infinitely small, nor yet nothing. May we not call them the 
ghosts of departed quantities?”, George Berkeley, Bishop of Cloyne, 1734. 


The 1700’s and 1800's were a time of great discoveries in mathe- 
matics but also of several crises. The discovery of calculus by Newton 
and Leibnitz in the late 1600’s ushered a golden age of problem solv- 
ing. Many longstanding challenges succumbed to the new tools that 
were discovered, and mathematicians got ever better at doing some 
truly impressive calculations. However, the rigorous foundations 
behind these calculations left much to be desired. Mathematicians 
manipulated infinitesimal quantities and infinite series cavalierly, and 
while most of the time they ended up with the correct results, there 
were a few strange examples (such as trying to calculate the value 


of the infinite series 1 — 1 + 1 — 1 + 1+...) which seemed to give 
out different answers depending on the method of calculation. This 
led to a growing sense of unease in the foundations of the subject 


Figure 11.1: Outline of the results of this chapter. One 
version of Gédel’s Incompleteness Theorem is an 
immediate consequence of the uncomputability of the 
Halting problem. To obtain the theorem as originally 
stated (for statements about the integers) we first 
prove that the QMS problem of determining truth 

of quantified statements involving both integers and 
strings is uncomputable. We do so using the notion of 
Turing Machine configurations but there are alternative 
approaches to do so as well, see Remark 11.14. 


which was addressed in the works of mathematicians such as Cauchy, 
Weierstrass, and Riemann, who eventually placed analysis on firmer 
foundations, giving rise to the e’s and 0’s that students taking honors 
calculus grapple with to this day. 

In the beginning of the 20th century, there was an effort to replicate 
this effort, in greater rigor, to all parts of mathematics. The hope was 
to show that all the true results of mathematics can be obtained by 
starting with a number of axioms, and deriving theorems from them 
using logical rules of inference. This effort was known as the Hilbert 
program, named after the influential mathematician David Hilbert. 

Alas, it turns out the results we’ve seen dealt a devastating blow to 
this program, as was shown by Kurt Gödel in 1931: 


Theorem 11.1 — Gédel’s Incompleteness Theorem: informal version. For 
every sound proof system V for sufficiently rich mathematical 
statements, there is a mathematical statement that is true but is not 
provable in V. 


11.1.1 Defining “Proof Systems” 

Before proving Theorem 11.1, we need to define “proof systems” and 
even formally define the notion of a “mathematical statement”. In 
geometry and other areas of mathematics, proof systems are often 
defined by starting with some basic assumptions or axioms and then 
deriving more statements by using inference rules such as the famous 
Modus Ponens, but what axioms shall we use? What rules? We will 
use an extremely general notion of proof systems, not even restricting 
ourselves to ones that have the form of axioms and inference. 


Mathematical statements. At the highest level, a mathematical statement 
is simply a piece of text, which we can think of as a string x € {0,1}*. 
Mathematical statements contain assertions whose truth does not 
depend on any empirical fact, but rather only on properties of abstract 
objects. For example, the following is a mathematical statement:! 
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1 This happens to be a false statement. 


“The number 2,696,635,869,504,783,333,238,805,675,613, 588,278,597,832,162,617,892,474,670,798,113 


is prime”. 


Mathematical statements do not have to involve numbers. They 
can assert properties of any other mathematical object including sets, 
strings, functions, graphs and yes, even programs. Thus, another exam- 
ple of a mathematical statement is the following:? 


The following Python function halts on every positive integer n 


def f(n): 
if n==1: return 1 
return f(3*n+1) if n % 2 else f(n//2) 
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2 Tt is unknown whether this statement is true or false. 
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Proof systems. A proof fora statement x € {0,1}* is another piece of 
text w € {0, 1}* that certifies the truth of the statement asserted in x. 
The conditions for a valid proof system are: 


1. (Effectiveness) Given a statement x and a proof w, there is an algo- 
rithm to verify whether or not w is a valid proof for x. (For exam- 
ple, by going line by line and checking that each line follows from 
the preceding ones using one of the allowed inference rules.) 


2. (Soundness) If there is a valid proof w for x then z is true. 


These are quite minimal requirements for a proof system. Require- 
ment 2 (soundness) is the very definition of a proof system: you 
shouldn't be able to prove things that are not true. Requirement 1 
is also essential. If there is no set of rules (i.e., an algorithm) to check 
that a proof is valid then in what sense is it a proof system? We could 
replace it with a system where the “proof” for a statement z is “trust 
me: it’s true”. 

We formally define proof systems as an algorithm V where 
V(a,w) = 1 holds if the string w is a valid proof for the statement zx. 
Even if x is true, the string w does not have to be a valid proof for it 
(there are plenty of wrong proofs for true statements such as 4=2+2) 
but if w is a valid proof for x then x must be true. 


Definition 11.2 — Proof systems. Let 7 C  {0,1}* be some set (which 
we consider the “true” statements). A proof system for T is an algo- 
rithm V that satisfies: 


1. (Effectiveness) For every x, w € {0,1}*, V(x, w) halts with an out- 
put of either 0 or 1. 


2. (Soundness) For every x ¢ J and w € {0,1}*, V(z,w) = 0. 


A true statement « € J is unprovable (with respect to V ) if for 
everyw € {0,1}*,V(z,w) = 0. We say that V is complete if there 
does not exist a true statement x that is unprovable with respect to 
v, 


11.2 GODEL’S INCOMPLETENESS THEOREM: COMPUTATIONAL 
VARIANT 


Our first formalization of Theorem 11.1 involves statements about 
Turing machines. We let H be the set of strings x € {0,1}* that have 
the form “Turing machine M halts on the zero input”. 


Theorem 11.3 — Gédel’s Incompleteness Theorem: computational variant. 
There does not exist a complete proof system for H. 


Proof Idea: 

If we had such a complete and sound proof system then we could 
solve the HALTONZERO problem. On input a Turing machine M, 
we would search all purported proofs w and halt as soon as we find 
a proof of either “M halts on zero” or “M does not halt on zero”. If 
the system is sound and complete then we will eventually find such a 
proof, and it will provide us with the correct output. 

* 


Proof of Theorem 11.3. Assume for the sake of contradiction that there 


was such a proof system V. We will use V to build an algorithm A 
that computes HALTONZERO, hence contradicting Theorem 9.9. Our 
algorithm A will work as follows: 


If M halts on 0 then under our assumption there exists w that 
proves this fact, and so when Algorithm A reaches n = |w| we will 
eventually find this w and output 1, unless we already halted be- 
fore. But we cannot halt before and output a wrong answer because 
it would contradict the soundness of the proof system. Similarly, this 
shows that if M does not halt on 0 then (since we assume there is a 
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proof of this fact too) our algorithm A will eventually halt and output 


0. 


@) 


Remark 11.5 — The Gédel statement (optional). One can 
extract from the proof of Theorem 11.3 a procedure 
that for every proof system V, yields a true statement 
x* that cannot be proven in V. But Gédel’s proof 

gave a very explicit description of such a statement x* 
which is closely related to the “Liar’s paradox”. That 
is, Gédel’s statement x* was designed to be true if and 
only if V,,<40,1;+V (x, w) = 0. In other words, it satisfied 
the following property 


x* is true = x* does not havea proofinV (11.1) 


One can see that if x* is true, then it does not have a 
proof, but if it is false then (assuming the proof sys- 
tem is sound) then it cannot have a proof, and hence 
x* must be both true and unprovable. One might 
wonder how is it possible to come up with an x* that 
satisfies a condition such as (11.1) where the same 
string x* appears on both the right-hand side and the 
left-hand side of the equation. The idea is that the 
proof of Theorem 11.3 yields a way to transform every 
statement x into a statement F(x) that is true if and 
only if x does not have a proof in V. Thus «* needs to 
be a fixed point of F: a sentence such that z* = F(a*). 
It turns out that we can always find such a fixed point 
of F. We've already seen this phenomenon in the \ 
calculus, where the Y combinator maps every F into 
a fixed point Y F of F. This is very related to the idea 
of programs that can print their own code. Indeed, 
Scott Aaronson likes to describe Gédel’s statement as 
follows: 


The following sentence repeated twice, the sec- 
ond time in quotes, is not provable in the formal 
system V. “The following sentence repeated 
twice, the second time in quotes, is not provable 
in the formal system V.” 


In the argument above we actually showed that x* is 
true, under the assumption that V is sound. Since x* 
is true and does not have a proof in V, this means that 
we cannot carry the above argument in the system V, 
which means that V cannot prove its own soundness 
(or even consistency: that there is no proof of both a 
statement and its negation). Using this idea, it’s not 
hard to get Gédel’s second incompleteness theorem, 
which says that every sufficiently rich V cannot prove 


11.3 QUANTIFIED INTEGER STATEMENTS 


There is something “unsatisfying” about Theorem 11.3. Sure, it shows 
there are statements that are unprovable, but they don’t feel like “real” 
statements about math. After all, they talk about programs rather than 
numbers, matrices, or derivatives, or whatever it is they teach in math 
courses. It turns out that we can get an analogous result for statements 
such as “there are no positive integers x and y such that z? — 2 = 
y’”, or “there are positive integers x, y, z such that x? + yê = 21!” 
that only talk about natural numbers. It doesn’t get much more “real 
math” than this. Indeed, the 19th century mathematician Leopold 
Kronecker famously said that “God made the integers, all else is the 
work of man.” (By the way, the status of the above two statements is 
unknown.) 

To make this more precise, let us define the notion of quantified 
integer statements: 


Definition 11.6 — Quantified integer statements. A quantified integer state- 
ment is a well-formed statement with no unbound variables involv- 
ing integers, variables, the operators >, <, x,+,—, =, the logical 
operations — (NOT), ^ (AND), and v (OR), as well as quantifiers 
of the form J,,-,, and Y en where x, y are variable names. 


We often care deeply about determining the truth of quantified 
integer statements. For example, the statement that Fermat’s Last 
Theorem is true for n = 3 can be phrased as the quantified integer 
statement 


The twin prime conjecture, that states that there is an infinite num- 
ber of numbers p such that both p and p + 2 are primes can be phrased 
as the quantified integer statement 


VneNSpen(P > n) A PRIME(p) A PRIME(p + 2) 


where we replace an instance of PRIME(q) with the statement (q > 
1) A VaenYoenla = 1) V (a =q) V 7(a x b = q). 


aeNabendeen(@ > OAO > O)A(e > O)A(axaxatbxbxb=cxecxe). 
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The claim (mentioned in Hilbert’s quote above) that are infinitely 
many primes of the form p = 2" + 1 can be phrased as follows: 


VneNopen(p > n) A PRIME(p)A 
(Vpey(k #2 A PRIME(K)) > -DIVIDES(k,p — 1)) 


(11.2) 


where DIVIDES (a, b) is the statement 3„enb x c = a. In English, this 
corresponds to the claim that for every n there is some p > n such that 
all of p — 1’s prime factors are equal to 2. 


Much of number theory is concerned with determining the truth 


of quantified integer statements. Since our experience has been that, 
given enough time (which could sometimes be several centuries) hu- 
manity has managed to do so for the statements that it cared enough 
about, one could (as Hilbert did) hope that eventually we would be 
able to prove or disprove all such statements. Alas, this turns out to be 
impossible: 


Theorem 11.8 — Gédel’s Incompleteness Theorem for quantified integer state- 
ments. Let V : {0,1}* — {0,1} a computable purported verification 
procedure for quantified integer statements. Then either: 


e V is not sound: There exists a false statement x and a string 
w € {0, 1}* such that V (x, w) = 1. 


or 


e V is not complete: There exists a true statement x such that for 
every w € {0,1}*, V(x, w) = 0. 


Theorem 11.8 is a direct corollary of the following result, just 
as Theorem 11.3 was a direct corollary of the uncomputability of 
HALTONZERO: 


Theorem 11.9 — Uncomputability of quantified integer statements. Let 

QIS : {0,1}* — {0,1} be the function that given a (string rep- 
resentation of) a quantified integer statement outputs 1 if it is true 
and 0 if it is false. Then QIS is uncomputable. 


Since a quantified integer statement is simply a sequence of sym- 
bols, we can easily represent it as a string. For simplicity we will as- 
sume that every string represents some quantified integer statement, 
by mapping strings that do not correspond to such a statement to an 


arbitrary statement such as 5,,-,)2 = 1. 


In the rest of this chapter, we will show the proof of Theorem 11.8, 
following the outline illustrated in Fig. 11.1. 


11.4 DIOPHANTINE EQUATIONS AND THE MRDP THEOREM 


Many of the functions people wanted to compute over the years in- 
volved solving equations. These have a much longer history than 
mechanical computers. The Babylonians already knew how to solve 
some quadratic equations in 2000BC, and the formula for all quadrat- 
ics appears in the Bakhshali Manuscript that was composed in India 
around the 3rd century. During the Renaissance, Italian mathemati- 
cians discovered generalization of these formulas for cubic and quar- 
tic (degrees 3 and 4) equations. Many of the greatest minds of the 
17th and 18th century, including Euler, Lagrange, Leibniz and Gauss 
worked on the problem of finding such a formula for quintic equations 
to no avail, until in the 19th century Ruffini, Abel and Galois showed 
that no such formula exists, along the way giving birth to group theory. 
However, the fact that there is no closed-form formula does 
not mean we can not solve such equations. People have been 
solving higher degree equations numerically for ages. The Chinese 
manuscript Jiuzhang Suanshu from the first century mentions such 
approaches. Solving polynomial equations is by no means restricted 
only to ancient history or to students’ homework. The gradient 
descent method is the workhorse powering many of the machine 
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learning tools that have revolutionized Computer Science over the last 
several years. 

But there are some equations that we simply do not know how to 
solve by any means. For example, it took more than 200 years until peo- 
ple succeeded in proving that the equation a!! + btt = c'! has no 
solution in integers. The notorious difficulty of so called Diophantine 
equations (i.e., finding integer roots of a polynomial) motivated the 
mathematician David Hilbert in 1900 to include the question of find- 
ing a general procedure for solving such equations in his famous list 
of twenty-three open problems for mathematics of the 20th century. I 
don’t think Hilbert doubted that such a procedure exists. After all, the 
whole history of mathematics up to this point involved the discovery 
of ever more powerful methods, and even impossibility results such 
as the inability to trisect an angle with a straightedge and compass, or 
the non-existence of an algebraic formula for quintic equations, merely 
pointed out to the need to use more general methods. 

Alas, this turned out not to be the case for Diophantine equations. 
In 1970, Yuri Matiyasevich, building on a decades long line of work by 
Martin Davis, Hilary Putnam and Julia Robinson, showed that there is 
simply no method to solve such equations in general: 


Theorem 11.10 — MRDP Theorem. Let DIO {0,1}* — {0,1} be the 
function that takes as input a string describing a 100-variable poly- 
nomial with integer coefficients P(x, ..., £99) and outputs 1 if and 
only if there exists zp, ..., 299 E€ N s.t. P(z,..., Z99) = 0. 

Then DIO is uncomputable. 


As usual, we assume some standard way to express numbers and 
text as binary strings. The constant 100 is of course arbitrary; the prob- 
lem is known to be uncomputable even for polynomials of degree 
four and at most 58 variables. In fact the number of variables can be 
reduced to nine, at the expense of the polynomial having a larger (but 


still constant) degree. See Jones’s paper for more about this issue. 


95% of people cannot solve this! 


ə» + B + ® = 4 
&+S P-O Pk 
Can you find positive whole values 


for Ð. he. and ? 


Figure 11.2: Diophantine equations such as finding 

a positive integer solution to the equation a(a + 
b)\(a + c) +b(b + a)(b + c) + c(c + a)(c + b) = 
4(a + b)(a + c)(b + c) (depicted more compactly 
and whimsically above) can be surprisingly difficult. 
There are many equations for which we do not know 
if they have a solution, and there is no algorithm to 
solve them in general. The smallest solution for this 
equation has 80 digits! See this Quora post for more 
information, including the credits for this image. 


° This is a special case of what’s known as “Fermat's 
Last Theorem” which states that a” + b” = c” has no 
solution in integers for n > 2. This was conjectured in 
1637 by Pierre de Fermat but only proven by Andrew 
Wiles in 1991. The case n = 11 (along with all other 
so called “regular prime exponents”) was established 
by Kummer in 1850. 


11.5 HARDNESS OF QUANTIFIED INTEGER STATEMENTS 


We will not prove the MRDP Theorem (Theorem 11.10). However, as 
we mentioned, we will prove the uncomputability of QIS (i.e., Theo- 
rem 11.9), which is a special case of the MRDP Theorem. The reason 
is that a Diophantine equation is a special case of a quantified integer 


statement where the only quantifier is 3. This means that deciding the 
truth of quantified integer statements is a potentially harder problem 


than solving Diophantine equations, and so it is potentially easier to 
prove that QIS is uncomputable. 


Our proof of the uncomputability of QIS (i.e. Theorem 11.9) will, as 
usual, go by reduction from the Halting problem, but we will do so in 
two steps: 


1. We will first use a reduction from the Halting problem to show that 
deciding the truth of quantified mixed statements is uncomputable. 
Quantified mixed statements involve both strings and integers. 
Since quantified mixed statements are a more general concept than 
quantified integer statements, it is easier to prove the uncomputabil- 
ity of deciding their truth. 


2. We will then reduce the problem of quantified mixed statements to 
quantified integer statements. 


11.5.1 Step 1: Quantified mixed statements and computation histories 
We define quantified mixed statements as statements involving not just 
integers and the usual arithmetic operators, but also string variables as 
well. 


Definition 11.12 — Quantified mixed statements. A quantified mixed state- 
ment is a well-formed statement with no unbound variables involv- 
ing integers, variables, the operators >, <, x,+,—, =, the logical 
operations — (NOT), ^A (AND), and v (OR), as well as quanti- 
fiers of the form Fen, Jac{0,1}7 Vyens Voefo,1}¢ Where z, y, a, b are 


variable names. These also include the operator |a| which returns 
the length of a string valued variable a, as well as the operator a; 
where a is a string-valued variable and i is an integer valued ex- 
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pression which is true if i is smaller than the length of a and the i” 
coordinate of a is 1, and is false otherwise. 


For example, the true statement that for every string a there is a 
string b that corresponds to a in reverse order can be phrased as the 
following quantified mixed statement 


Vae{0,1}"Se{0,1}* (lal = Ibl) A (Vieni < la| => (a; & biaj-:)) - 
Quantified mixed statements are more general than quantified 
integer statements, and so the following theorem is potentially easier 

to prove than Theorem 11.9: 


Theorem 11.13 — Uncomputability of quantified mixed statements. Let 

QMS : {0,1}* — {0,1} be the function that given a (string rep- 
resentation of) a quantified mixed statement outputs 1 if it is true 
and 0 if it is false. Then QMS is uncomputable. 


Proof Idea: 

The idea behind the proof is similar to that used in showing that 
one-dimensional cellular automata are Turing complete (Theorem 8.7) 
as well as showing that equivalence (or even “fullness”) of context 
free grammars is uncomputable (Theorem 10.10). We use the notion 
of a configuration of a NAND-IM program as in Definition 8.8. Such 
a configuration can be thought of as a string a over some large-but- 
finite alphabet X describing its current state, including the values 
of all arrays, scalars, and the index variable i. It can be shown that 
if a is the configuration at a certain step of the execution and £ is 
the configuration at the next step, then 8; = a; for all j outside of 
{i —1,1,7 + 1} where 7 is the value of i. In particular, every value 6; is 
simply a function of a;_; j j+1; Using these observations we can write 
a quantified mixed statement NEXT(a, 8) that will be true if and only if 
Gis the configuration encoding the next step after a. Since a program 
P halts on input « if and only if there is a sequence of configurations 
a°,...,a°! (known as a computation history) starting with the initial 
configuration with input x and ending in a halting configuration, we 
can define a quantified mixed statement to determine if there is such 
a statement by taking a universal quantifier over all strings H (for 
history) that encode a tuple (a°, at, ...,a*~') and then checking that 
a? and a‘! are valid starting and halting configurations, and that 
NEXT(a/, a4") is true for every j € {0,...,¢— 2}. 

* 


Proof of Theorem 11.13. The proof is obtained by a reduction from the 
Halting problem. Specifically, we will use the notion of a configura- 
tion of a Turing machines (Definition 8.8) that we have seen in the 
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context of proving that one dimensional cellular automata are Turing 
complete. We need the following facts about configurations: 


e For every Turing machine M, there is a finite alphabet X, and a 
configuration of M is a string a € &*. 


e Aconfiguration a encodes all the state of the program at a particu- 
lar iteration, including the array, scalar, and index variables. 


e If ais a configuration, then 8 = NEXT p(a) denotes the configura- 
tion of the computation after one more iteration. £ is a string over X 
of length either |a| or |a| + 1, and every coordinate of 6 is a function 
of just three coordinates in a. That is, for every j € {0,...,|8| — 1}, 
B; = MAP p(aj_1,0;,0;,1) where MAPp : X? > X is some function 
depending on P. 


e There are simple conditions to check whether a string a is a valid 
starting configuration corresponding to an input z, as well as to 
check whether a string a is a halting configuration. In particular 
these conditions can be phrased as quantified mixed statements. 


e A program M halts on input x if and only if there exists a sequence 
of configurations H = (a°,a1,...,a7~1) such that (i) a° is a valid 
starting configuration of M with input z, (ii) a7 ~! is a valid halting 
configuration of P, and (iii) at! = NEXT> (a*) for every i € 
{0,..., T= 2}. 


We can encode such a sequence H of configuration as a binary 
string. For concreteness, we let £ = [log(|£| + 1)] and encode each 
symbol c in X U {";"} by a string in {0, 1}°. We use “;” as a “separator” 
symbol, and so encode H = (a°,a",...,a7~') as the concatenation 
of the encodings of each configuration, using “;” to separate the en- 
coding of a’ and a‘*! for every i € [T]. In particular for every Turing 
machine M, M halts on the input 0 if and only if the following state- 


ment y y is true 


Jye{0,1} H encodes halting configuration sequence starting with input 0 . 


If we can encode the statement y y as a quantified mixed statement 
then, since y y is true if and only if HALTONZERO(M) = 1, this 
would reduce the task of computing HALTONZERO to computing 
QMS, and hence imply (using Theorem 9.9 ) that QMS is uncom- 
putable, completing the proof. Indeed, y can be encoded as a quan- 
tified mixed statement for the following reasons: 


1. Leta, 6 € {0, 1}* be two strings that encode configurations of M. 
We can define a quantified mixed predicate NEXT (a, £) that is true 
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if and only if 8 = NEXT),;(q) (ie., 8 encodes the configuration 
obtained by proceeding from a in one computational step). Indeed 
NEXT (a, 8) is true if for every i € {0,...,|8|} which is a multiple 
of £, By ite- = MAP m (aie ip2e-1) where MAP y : {0, 1} > 
{0, 1} is the finite function above (identifying elements of © with 
their encoding in {0, 1}°). Since MAP y is a finite function, we can 
express it using the logical operations AND,OR, NOT (for example 
by computing MAP y with NAND’s). 


2. Using the above we can now write the condition that for every 
substring of H that has the form aENC(;)8 witha, 8 € {0, 1} 
and ENC(; ) being the encoding of the separator “;”, it holds that 
NEXT (a, p) is true. 


3. Finally, if a° is a binary string encoding the initial configuration of 
M on input 0, checking that the first |a°| bits of H equal ag can be 
expressed using AND,OR, and NOT’s. Similarly checking that the 
last configuration encoded by H corresponds to a state in which M 
will halt can also be expressed as a quantified statement. 


Together the above yields a computable procedure that maps every 
Turing machine M into a quantified mixed statement yọ such that 
HALTONZERO(M) = 1 if and only if QMS( m) = 1. This reduces 
computing HALTONZERO to computing QMS, and hence the uncom- 
putability of HALTONZERO implies the uncomputability of QMS. 

a 


11.5.2 Step 2: Reducing mixed statements to integer statements 
We now show how to prove Theorem 11.9 using Theorem 11.13. The 
idea is again a proof by reduction. We will show a transformation of 


every quantified mixed statement y into a quantified integer statement 


€ that does not use string-valued variables such that ¢ is true if and 
only if € is true. 

To remove string-valued variables from a statement, we encode 
every string by a pair integer. We will show that we can encode a 
string x € {0, 1}* by a pair of numbers (X,n) € N s.t. 


e n= |z] 


e There is a quantified integer statement COORD(X, i) that for every 
i < n, will be true if x; = 1 and will be false otherwise. 


This will mean that we can replace a “for all” quantifier over strings 
such as V,,<49,1}+ With a pair of quantifiers over integers of the form 
VxenVnen (and similarly replace an existential quantifier of the form 


J,<fo,1}+ With a pair of quantifiers 3xennen) - We can then replace all 
calls to |x| by n and all calls to x; by COORD(X, i). This means that 
if we are able to define COORD via a quantified integer statement, 


then we obtain a proof of Theorem 11.9, since we can use it to map 
every mixed quantified statement y to an equivalent quantified inte- 
ger statement € such that € is true if and only if y is true, and hence 
QMS(v) = QIS(€). Such a procedure implies that the task of comput- 
ing QMS reduces to the task of computing QIS, which means that the 
uncomputability of QMS implies the uncomputability of QIS. 

The above shows that proof of Theorem 11.9 all boils down to find- 
ing the right encoding of strings as integers, and the right way to 
implement COORD as a quantified integer statement. To achieve this 
we use the following technical result : 


Lemma 11.15 — Constructible prime sequence. There is a sequence of prime 
numbers py < pı < po < ~ such that there is a quantified integer 
statement PSEQ(p, i) that is true if and only if p = p;. 


Using Lemma 11.15 we can encode a x € {0,1}* by the numbers 
(X,n) where X = J], _, p; andn = |z|. We can then define the 
statement COORD(X, i) as 


COORD(X, i) = SyeyPSEQ(p, i) ^A DIVIDES(p, X) 


where DIVIDES(a, b), as before, is defined as 4.-,a x c = b. Note that 
indeed if X, n encodes the string x € {0,1}*, then for every i < n, 
COORD(X, i) = x; since p; divides X if and only if x; = 1. 

Thus all that is left to conclude the proof of Theorem 11.9 is to 
prove Lemma 11.15, which we now proceed to do. 


Proof. The sequence of prime numbers we consider is the following: 
We fix C to be a sufficiently large constant (C = 2?™* will do) and 

define p; to be the smallest prime number that is in the interval [(i + 
C)’ + 1, (i + C + 1)? — 1]. It is known that there exists such a prime 
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number for every i € N. Given this, the definition of PSEQ(p, i) is 
simple: 


(p > (i+C) x (i+C) x (i+-C))A(p < (i +C+1) x (i+C41) x (i+C+1) APRIME(p) A (Vy 7PRIME(p’) V (p' < (i+ C) x (i +C) x G+ € 


We leave it to the reader to verify that PSEQ(p, i) is true iff p = p;. 
a 


To sum up we have shown that for every quantified mixed state- 
ment y, we can compute a quantified integer statement € such that 
QMS(y) = 1 if and only if QIS(€) = 1. Hence the uncomputability 
of QMS (Theorem 11.13) implies the uncomputability of QIS, com- 
pleting the proof of Theorem 11.9, and so also the proof of Gédel’s 


Incompleteness Theorem for quantified integer statements (Theo- 
rem 11.8). 


11.6 EXERCISES 


Exercise 11.1 — Gédel’s Theorem from uncomputability of QIS. Prove Theo- 
rem 11.8 using Theorem 11.9. 


Exercise 11.2 — Proof systems and uncomputability. Let FINDPROOF : 

{0, 1}* — {0, 1} be the following function. On input a Turing machine 
V (which we think of as the verifying algorithm for a proof system) 
anda string x € {0,1}*, FINDPROOF(V,«x) = 1 if and only if there 
exists w € {0,1}* such that V (x, w) = 1. 


1. Prove that FINDPROOF is uncomputable. 


2. Prove that there exists a Turing machine V such that V halts 


on every input x, v but the function FINDPROOF,, defined as * Hint: think of « as saying “Turing machine M halts 
, on input u” and w being a proof that is the number of 
FINDPROOF,,(a) = FINDPROOF(V, x) is uncomputable. See steps that it will take for this to happen. Can you find 


footnote for hint.4 an always-halting V that will verify such statements? 


Exercise 11.3 — Expression for floor. Let FSQRT(n,m) = Vjew((J x j) > 
m) V (j < n). Prove that FSQRT(n, m) is true if and only if n = |./m|]. 


Exercise 11.4 — axiomatic proof systems. For every representation of logical 
statements as strings, we can define an axiomatic proof system to 
consist of a finite set of strings A and a finite set of rules Ip, ... , Im—1 
with I, : ({0,1}*)*i — {0,1}* such that a proof (s,,...,s,,) that s,, 

is true is valid if for every i, either s; € A oris some j € [m] and 

are iy,...,4,, < i such that s; = I;(s;,,...,7,,). A system is sound if 
whenever there is no false s such that there is a proof that s is true. 
Prove that for every uncomputable function F : {0,1}* — {0,1} 

and every sound axiomatic proof system S' (that is characterized by a 
finite number of axioms and inference rules), there is some input x for 
which the proof system S is not able to prove neither that F(x) = 0 
nor that F(x) # 0. 


Exercise 11.5 — Post Corrrespondence Problem. In the Post Correspondence 
Problem the input is a set S = {(a°, 8°), ... , (8°71, B°-')} where each 
a’ and is a string in {0, 1}*. We say that PCP(S) = 1 if and only if 
there exists a list (ag, o), ---,(Q@m_—1; Bm_1) Of pairs in S such that 


a001 Am1 = BoP Bm-i - 


(We can think of each pair (a, 8) € S asa “domino tile” and the ques- 
tion is whether we can stack a list of such tiles so that the top and the 
bottom yield the same string.) It can be shown that the PCP is uncom- 
putable by a fairly straightforward though somewhat tedious proof 
(see for example the Wikipedia page for the Post Correspondence 
Problem or Section 5.2 in [Sip97]). 

Use this fact to provide a direct proof that QMS is uncomputable by 
showing that there exists a computable map R : {0,1}* — {0,1}* such 
that PCP(S) = QMS(R(S)) for every string S encoding an instance of 
the post correspondence problem. 

a 


Exercise 11.6 — Uncomputability of puzzle. Let PUZZLE : {0,1}* — {0,1} be 
the problem of determining, given a finite collection of types of “puz- 
zle pieces”, whether it is possible to put them together in a rectangle, 
see Fig. 11.3. Formally, we think of such a collection as a finite set £ 
(see Fig. 11.3). We model the criteria as to which pieces “fit together’ 
by a pair of finite function matchy, match,, : £? — {0,1} such that a 


7 


piece a fits above a piece b if and only if match,(a, b) = 1 and a piece 
c fits to the left of a piece d if and only if match, (c,d) = 1. To model 
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iisa mi 
Input: Lo , 
ga 


Output: 1 iff gJ es es 


-} Formal: match: £5 > {0,1} 
Xon-2 xon- 


find Xin- 

Formal: x € Z™" s.t. 
vi € {1..m— 1},j € {1..n— 1} 
match(x; j, Xi j— 


s : GVO * and xo,j = Xm—1,j 
1 


Figure 11.3: In the puzzle problem, the input can be 
thought of as a finite collection © of types of puz- 

zle pieces and the goal is to find out whether or not 
find a way to arrange pieces from these types in a 
rectangle. Formally, we model the input as a pair of 
functions match,,, matchy : E? — {0,1} that 
such that match,, (left, right) = 1 (respectively 
match;(up, down) = 1 ) if the pair of pieces are 
compatible when placed in their respective posi- 
tions. We assume © contains a special symbol Ø 
corresponding to having no piece, and an arrange- 
ment of puzzle pieces by an (m — 2) x (n — 2) 
rectangle is modeled by a string x € X” whose 
“outer coordinates’ ’ are Ø and such that for every 

i € [n — 1], j € [m — 1], matchy(x; j, £i+1,j) = land 
matcha (£i j, £i j+1) = L 
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the “straight edge” pieces that can be placed next to a “blank spot” 
we assume that © contains the symbol Ø and the matching functions 
are defined accordingly. A square tiling of X is an m x n long string 

x € X'”", such that for every i € {1,...,m — 2} and j € {1,...,n — 2}, 
x x x ) = 1 (i.e., every “internal pieve” 


match(« z£ 


ij? i+1,j? i,j+1 
fits in with the pieces adjacent to it). We also require all of the “outer 
pieces” (ie., x; ; where i € {0,m — 1} of j € {0,n — 1}) are “blank” 


or equal to Ø. The function PUZZLE takes as input a string describing 


i-1,j? ij- 


the set X and the function match and outputs 1 if and only if there is 
some square tiling of X: some not all blank string x € X™” satisfying 
the above condition. 


1. Prove that PUZZLE is uncomputable. 


2. Give a reduction from PUZZLE to QMS. 


Exercise 11.7 — MRDP exercise. The MRDP theorem states that the 
problem of determining, given a k-variable polynomial p with integer 
coefficients, whether there exists integers xo, ... ,%,_, such that 
p(Xo,+--;£,~_-1) = 0 is uncomputable. Consider the following quadratic 
integer equation problem: the input is a list of polynomials po, ... , Pm—1 
over k variables with integer coefficients, where each of the polynomi- 
als is of degree at most two (i.e., it is a quadratic function). The goal 

is to determine whether there exist integers £o, ..., 7;,_, that solve the 
equations py (x) = = = Pm—1 (£) = 0. 

Use the MRDP Theorem to prove that this problem is uncom- 
putable. That is, show that the function QUADINTEQ : {0,1}* > 
{0, 1} is uncomputable, where this function gets as input a string de- 
scribing the polynomials pg, ... ,Pm—1ı (each with integer coefficients 
and degree at most two), and outputs 1 if and only if there exists 


5 : = „4 : : 
To; ---, &p—1 E Z such that for every i € [m], p;(£o,.--,£g-1) = 0. See Xou cani replace the squatióny = t” with the pair 


5 of equations y = z? and z = 2”. Also, you can 
footnote for hint replace the equation w = zê with the three equations 
: w = yu, y = xt and u = z?. 


Exercise 11.8 — The Busy Beaver problem. In this question we define the 
NAND-TM variant of the busy beaver function. 


1. We define the function T : {0,1}* — N as follows: for every 
string P € {0,1}*, if P represents a NAND-TM program such that 
when P is executed on the input 0 (i.e., the string of length 1 that is 
simply 0), a total of M lines are executed before the program halts, 
then T(P) = M. Otherwise (if P does not represent a NAND-TM 
program, or it is a program that does not halt on 0), T(P) = 0. 
Prove that T is uncomputable. 


2 


2. Let TOWER(n) denote the number 22” i (that is, a “tower of pow- 
n times 
ers of two” of height n). To get a sense of how fast this function 


grows, TOWER(1) = 2, TOWER(2) = 2? = 4, TOWER(3) = 2” = 
16, TOWER(4) = 216 = 65536 and TOWER(5) = 285536 which 

is about 1070°°°. TOWER(6) is already a number that is too big to 
write even in scientific notation. Define NBB : N > N (for “NAND- 
TM Busy Beaver”) to be the function NBB(n) = max peto,1}» T(P) 
where T : N — Nis the function defined in Item 1. Prove that 
NBB grows faster than TOWER, in the sense that TOWER(n) = 
o(NBB(n)) (ie., for every e > 0, there exists ng such that for every 
n > No, TOWER(n) < €- NBB(n).).6 


11.7 BIBLIOGRAPHICAL NOTES 


As mentioned before, Gödel, Escher, Bach [Hof99] is a highly recom- 
mended book covering Gédel’s Theorem. A classic popular science 
book about Fermat’s Last Theorem is [Sin97]. 

Cantor’s are used for both Turing and Gédel’s theorems. In a twist 
of fate, using techniques originating from the works of Gödel and Tur- 
ing, Paul Cohen showed in 1963 that Cantor’s Continuum Hypothesis 
is independent of the axioms of set theory, which means that neither 
it nor its negation is provable from these axioms and hence in some 
sense can be considered as “neither true nor false” (see [Coh08]). The 
Continuum Hypothesis is the conjecture that for every subset S of R, 
either there is a one-to-one and onto map between S and N or there 
is a one-to-one and onto map between S and R. It was conjectured 
by Cantor and listed by Hilbert in 1900 as one of the most important 
problems in mathematics. See also the non-conventional survey of 
Shelah [She03]. See here for recent progress on a related question. 

Thanks to Alex Lombardi for pointing out an embarrassing mistake 
in the description of Fermat’s Last Theorem. (I said that it was open 
for exponent 11 before Wiles’ work.) 
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® You will not need to use very specific properties of 
the TOWER function in this exercise. For example, 
NBB(n) also grows faster than the Ackerman func- 
tion. You might find Aaronson’s blog post on the 
same topic to be quite interesting, and relevant to this 
book at large. If you like it then you might also enjoy 
this piece by Terence Tao. 


EFFICIENT ALGORITHMS 


Learning Objectives: 


e Describe at a high level some interesting 
computational problems. 


The difference between polynomial and 
exponential time. 


Examples of techniques for obtaining efficient 
algorithms 


Examples of how seemingly small differences 
in problems can potentially make huge 
1 2 differences in their computational complexity. 


Efficient computation: An informal introduction 


“The problem of distinguishing prime numbers from composite and of resolving 
the latter into their prime factors is ... one of the most important and useful 

in arithmetic ... Nevertheless we must confess that all methods ... are either 
restricted to very special cases or are so laborious ... they try the patience of 
even the practiced calculator ... and do not apply at all to larger numbers.”, 
Carl Friedrich Gauss, 1798 


“For practical purposes, the difference between algebraic and exponential order 
is often more crucial than the difference between finite and non-finite.”, Jack 
Edmunds, “Paths, Trees, and Flowers”, 1963 


“What is the most efficient way to sort a million 32-bit integers?”, Eric 
Schmidt to Barack Obama, 2008 


“I think the bubble sort would be the wrong way to go.”, Barack Obama. 


So far we have been concerned with which functions are computable 
and which ones are not. In this chapter we look at the finer question 
of the time that it takes to compute functions, as a function of their input 
length. Time complexity is extremely important to both the theory and 
practice of computing, but in introductory courses, coding interviews, 
and software development, terms such as “O(n) running time” are of- 
ten used in an informal way. People don’t have a precise definition of 
what a linear-time algorithm is, but rather assume that “they'll know 
it when they see it”. In this book we will define running time pre- 
cisely, using the mathematical models of computation we developed 
in the previous chapters. This will allow us to ask (and sometimes 
answer) questions such as: 


e “Is there a function that can be computed in O(n”) time but not in 
O(n) time?” 


e “Are there natural problems for which the best algorithm (and not 
just the best known) requires 2°”) time?” 


Compiled on 12.19.2022 22:58 
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While the difference between O(n) and O(n?) time can be crucial in 
practice, in this book we focus on the even bigger difference between 
polynomial and exponential running time. As we will see, the difference 
between polynomial versus exponential time is typically insensitive to 
the choice of the particular computational model, a polynomial-time 
algorithm is still polynomial whether you use Turing machines, RAM 
machines, or parallel cluster as your model of computation, and sim- 
ilarly an exponential-time algorithm will remain exponential in all of 
these platforms. One of the interesting phenomena of computing is 
that there is often a kind of a “threshold phenomenon” or “zero-one 
law” for running time. Many natural problems can either be solved 
in polynomial running time with a not-too-large exponent (e.g., some- 
thing like O(n?) or O(n*)), or require exponential (e.g., at least 220 
or 2°(V")) time to solve. The reasons for this phenomenon are still not 
fully understood, but some light on it is shed by the concept of NP 
completeness, which we will see in Chapter 15. 

This chapter is merely a tiny sample of the landscape of computa- 
tional problems and efficient algorithms. If you want to explore the 
field of algorithms and data structures more deeply (which I very 
much hope you do!), the bibliographical notes contain references to 
some excellent texts, some of which are available freely on the web. 
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12.1 PROBLEMS ON GRAPHS 


In this chapter we discuss several examples of important computa- 
tional problems. Many of the problems will involve graphs. We have 
already encountered graphs before (see Section 1.4.4) but now quickly 
recall the basic notation. A graph G consists of a set of vertices V and 
edges E where each edge is a pair of vertices. We typically denote by 
n the number of vertices (and in fact often consider graphs where the 
set of vertices V equals the set [n] of the integers between 0 and n — 1). 
In a directed graph, an edge is an ordered pair (u, v), which we some- 
times denote as u v. In an undirected graph, an edge is an unordered 
pair (or simply a set) {u, v} which we sometimes denote as uv or 
u ~ v. An equivalent viewpoint is that an undirected graph corre- 
sponds to a directed graph satisfying the property that whenever the 
edge uv is present then so is the edge vu. In this chapter we restrict 
our attention to graphs that are undirected and simple (i.e., containing 
no parallel edges or self-loops). Graphs can be represented either in 
the adjacency list or adjacency matrix representation. We can transform 
between these two representations using O(n”) operations, and hence 
for our purposes we will mostly consider them as equivalent. 

Graphs are so ubiquitous in computer science and other sciences 
because they can be used to model a great many of the data that we 
encounter. These are not just the “obvious” data such as the road 
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network (which can be thought of as a graph of whose vertices are 
locations with edges corresponding to road segments), or the web 
(which can be thought of as a graph whose vertices are web pages 
with edges corresponding to links), or social networks (which can 

be thought of as a graph whose vertices are people and the edges 
correspond to friend relation). Graphs can also denote correlations in 
data (e.g., graph of observations of features with edges corresponding 
to features that tend to appear together), causal relations (e.g., gene 
regulatory networks, where a gene is connected to gene products it 
derives), or the state space of a system (e.g., graph of configurations 
of a physical system, with edges corresponding to states that can be 
reached from one another in one step). 


12.1.1 Finding the shortest path in a graph 

The shortest path problem is the task of finding, given a graphG = 

(V, E) and two vertices s,t € V, the length of the shortest path 
between s and t (if such a path exists). That is, we want to find the 
smallest number k such that there are vertices vp, v4, ...,U,, With vg = s, 
v, = t and for every i € {0,...,4—1} an edge between v; and v;,,. For- 
mally, we define MINPATH : {0,1}* — {0,1}* to be the function that 
on input a triple (G, s,t) (represented as a string) outputs the number 
k which is the length of the shortest path in G between s and t ora 
string representing no path if no such path exists. (In practice people 
often want to also find the actual path and not just its length; it turns 
out that the algorithms to compute the length of the path often yield 
the actual path itself as a byproduct, and so everything we say about 
the task of computing the length also applies to the task of finding the 
path.) 

If each vertex has at least two neighbors then there can be an expo- 
nential number of paths from s to t, but fortunately we do not have to 
enumerate them all to find the shortest path. We can find the short- 
est path using a breadth first search (BFS), enumerating s’s neigh- 
bors, and then neighbors’ neighbors, etc.. in order. If we maintain 
the neighbors in a list we can perform a BFS in O(n”) time, while us- 
ing a queue we can do this in O(m) time.! Dijkstra’s algorithm is a 
well-known generalization of BFS to weighted graphs. More formally, 
the algorithm for computing the function MINPATH is described in 
Algorithm 12.2. 


Figure 12.1: Some examples of graphs found on the 
Internet. 


1 A queue is a data structure for storing a list of el- 
ements in “First In First Out (FIFO)” order. Each 
“pop” operation removes an element from the queue 
in the order that they were “pushed” into it; see the 
Wikipedia page. 
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Since we only add to the queue vertices w with D[w] = œ (and 
then immediately set D[w] to an actual number), we never push to 
the queue a vertex more than once, and hence the algorithm makes at 
most n “push” and “pop” operations. For each vertex v, the number 
of times we run the inner loop is equal to the degree of v and hence 
the total running time is proportional to the sum of all degrees which 
equals twice the number m of edges. Algorithm 12.2 returns the cor- 
rect answer since the vertices are added to the queue in the order of 
their distance from s, and hence we will reach t after we have explored 
all the vertices that are closer to s than t. 
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12.1.2 Finding the longest path in a graph 


The longest path problem is the task of finding the length of the longest 
simple (i.e., non-intersecting) path between a given pair of vertices 

s and t ina given graph G. If the graph is a road network, then the 
longest path might seem less motivated than the shortest path (unless 
you are the kind of person that always prefers the “scenic route”). 
But graphs can and are used to model a variety of phenomena, and in 
many such cases finding the longest path (and some of its variants) 
can be very useful. In particular, finding the longest path is a gener- 
alization of the famous Hamiltonian path problem which asks for a 
maximally long simple path (i.e., path that visits all n vertices once) 
between s and t, as well as the notorious traveling salesman problem 
(TSP) of finding (in a weighted graph) a path visiting all vertices of 
cost at most w. TSP is a classical optimization problem, with appli- 
cations ranging from planning and logistics to DNA sequencing and 
astronomy. 

Surprisingly, while we can find the shortest path in O(m) time, 
there is no known algorithm for the longest path problem that signif- 
icantly improves on the trivial “exhaustive search” or “brute force” 
algorithm that enumerates all the exponentially many possibilities 
for such paths. Specifically, the best known algorithms for the longest 
path problem take O(c”) time for some constant c > 1. (At the mo- 
ment the best record is c ~ 1.65 or so; even obtaining an O(2”) time 
bound is not that simple, see Exercise 12.1.) 


12.1.3 Finding the minimum cut in a graph 

Given a graph G = (V, E), a cut of G is a subset S C V such that S 
is neither empty nor is it all of V. The edges cut by S are those edges 
where one of their endpoints is in S and the other is in S = V \ S. We 
denote this set of edges by E(S, 9). If s,t € V are a pair of vertices 
then an s, t cut is a cut such thats € Sandt € S (see Fig. 12.3). 
The minimum s,t cut problem is the task of finding, given s and t, the 
minimum number k such that there is an s, t cut cutting k edges (the 
problem is also sometimes phrased as finding the set that achieves 
this minimum; it turns out that algorithms to compute the number 
often yield the set as well). Formally, we define MINCUT : {0,1}* > 


Figure 12.2: A knight's tour can be thought of as a 
maximally long path on the graph corresponding to 
a chessboard where we put an edge between any two 
squares that can be reached by one step via a legal 
knight move. 
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{0, 1}* to be the function that on input a string representing a triple 
(G = (V, E),s,t) of a graph and two vertices, outputs the minimum 
number k such that there exists a set S C V withs € S,t ¢ Sand 
|E(S, S)| = k. 

Computing minimum s, t cuts is useful in many applications since 
minimum cuts often correspond to bottlenecks. For example, in a com- 
munication or railroad network the minimum cut between s and t 
corresponds to the smallest number of edges that, if dropped, will 
disconnect s from t. (This was actually the original motivation for this 
problem; see Section 12.6.) Similar applications arise in scheduling 
and planning. In the setting of image segmentation, one can define a 
graph whose vertices are pixels and whose edges correspond to neigh- 
boring pixels of distinct colors. If we want to separate the foreground 
from the background then we can pick (or guess) a foreground pixel s 
and background pixel t and ask for a minimum cut between them. 

The naive algorithm for computing MINCUT will check all 2” pos- 
sible subsets of an n-vertex graph, but it turns out we can do much 
better than that. As we’ve seen in this book time and again, there is 
more than one algorithm to compute the same function, and some 
of those algorithms might be more efficient than others. Luckily the 
minimum cut problem is one of those cases. In particular, as we will 
see in the next section, there are algorithms that compute MINCUT in 
time which is polynomial in the number of vertices. 


12.1.4 Min-Cut Max-Flow and Linear programming 
We can obtain a polynomial-time algorithm for computing MINCUT 
using the Max-Flow Min-Cut Theorem. This theorem says that the 
minimum cut between s and t equals the maximum amount of flow 
we can send from s to t, if every edge has unit capacity. Specifically, 
imagine that every edge of the graph corresponded to a pipe that 
could carry one unit of fluid per one unit of time (say 1 liter of water 
per second). The maximum s, t flow is the maximum units of water 
that we could transfer from s to t over these pipes. If there is an s, t 
cut of k edges, then the maximum flow is at most k. The reason is 
that such a cut S acts as a “bottleneck” since at most k units can flow 
from S to its complement at any given unit of time. This means that 
the maximum s, t flow is always at most the value of the minimum 
s,t cut. The surprising and non-trivial content of the Max-Flow Min- 
Cut Theorem is that the maximum flow is also at least the value of the 
minimum cut, and hence computing the cut is the same as computing 
the flow. 

The Max-Flow Min-Cut Theorem reduces the task of computing a 
minimum cut to the task of computing a maximum flow. However, this 
still does not show how to compute such a flow. The Ford-Fulkerson 


Figure 12.3: A cut ina graph G = (V, E) is simply a 
subset S of its vertices. The edges that are cut by S 
are all those whose one endpoint is in S and the other 
one is in S = V \ S. The cut edges are colored red in 
this figure. 
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Algorithm is a direct way to compute a flow using incremental im- 
provements. But computing flows in polynomial time is also a special 
case of a much more general tool known as linear programming. 

A flow on a graph G of m edges can be modeled as a vector x € R”™ 
where for every edge e, x, corresponds to the amount of water per 
time-unit that flows on e. We think of an edge e as an ordered pair 
(u, v) (we can choose the order arbitrarily) and let x, be the amount 
of flow that goes from u to v. (If the flow is in the other direction then 
we make x, negative.) Since every edge has capacity one, we know 
that —1 < x, < 1 for every edge e. A valid flow has the property that 
the amount of water leaving the source s is the same as the amount 
entering the sink t, and that for every other vertex v, the amount of 
water entering and leaving v is the same. 

Mathematically, we can write these conditions as follows: 


So ret) 2, =0 


eps edt 
Yr =0 Vuev\{s,t} (12.1) 
edv 
=] < Te < 1 VecE 


where for every vertex v, summing over e 5 v means summing over all 
the edges that touch v. 

The maximum flow problem can be thought of as the task of max- 
imizing `, Te over all the vectors x € R™ that satisfy the above 
conditions (12.1). Maximizing a linear function (x) over the set of 
x € R” that satisfy certain linear equalities and inequalities is known 
as linear programming. Luckily, there are polynomial-time algorithms 
for solving linear programming, and hence we can solve the maxi- 
mum flow (and so, equivalently, minimum cut) problem in polyno- 
mial time. In fact, there are much better algorithms for maximum- 
flow/minimum-cut, even for weighted directed graphs, with currently 
the record standing at O(min{m?!°/?,m,/n}) time. 


Solved Exercise 12.1 — Global minimum cut. Given a graph G = (V, E), 
define the global minimum cut of G to be the minimum over all S C V 
with S + @and S + V of the number of edges cut by S. Prove that 
there is a polynomial-time algorithm to compute the global minimum 
cut of a graph. 


Solution: 
By the above we know that there is a polynomial-time algorithm 
A that on input (G, s,t) finds the minimum s, t cut in the graph 
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G. Using A, we can obtain an algorithm B that on input a graph G 
computes the global minimum cut as follows: 


1. For every distinct pairs,t € 
A(G, s,t). 


V, Algorithms B setsk,, < 


2. B returns the minimum of k, , over all distinct pairs s, t 


The running time of B will be O(n”) times the running time of A 
and hence polynomial time. Moreover, if the global minimum cut 
is S, then when B reaches an iteration withs € Sandt ¢ S it will 
obtain the value of this cut, and hence the value output by B will 
be the value of the global minimum cut. 

The above is our first example of a reduction in the context of 
polynomial-time algorithms. Namely, we reduced the task of com- 
puting the global minimum cut to the task of computing minimum 
s,t cuts. 


12.1.5 Finding the maximum cut in a graph 

The maximum cut problem is the task of finding, given an input graph 
G = (V, E), the subset S C V that maximizes the number of edges 
cut by S. (We can also define an s, t-cut variant of the maximum cut 
like we did for minimum cut; the two variants have similar complexity 
but the global maximum cut is more common in the literature.) Like 
its cousin the minimum cut problem, the maximum cut problem is 
also very well motivated. For example, maximum cut arises in VLSI 
design, and also has some surprising relation to analyzing the Ising 
model in statistical physics. 

Surprisingly, while (as we’ve seen) there is a polynomial-time al- 
gorithm for the minimum cut problem, there is no known algorithm 
solving maximum cut much faster than the trivial “brute force” algo- 
rithm that tries all 2” possibilities for the set S. 


12.1.6 A note on convexity 

There is an underlying reason for the sometimes radical difference 
between the difficulty of maximizing and minimizing a function over 
a domain. If D C R”, then a function f : D — R is convex if for every 
z,y E Dandp € [0,1] f(px + (1—p)y) < pfx) + (1 — p) f(y). 
That is, f applied to the p-weighted midpoint between zx and y is 
smaller than the p-weighted average value of f. If D itself is convex 
(which means that if x, y are in D then so is the line segment between 
them), then this means that if x is a local minimum of f then it is also 
a global minimum. The reason is that if f(y) < f(x) then every point 
z = px + (1 — p)y on the line segment between x and y will satisfy 
f(z) < pf(x)+ (1 — p)f(y) < f(x) and hence in particular x cannot 


pny y 


Figure 12.4: In a convex function f (left figure), for 
every x and y and p € [0, 1] it holds that f (px + (1 — 
p)y) < p- f(x)+(1—p): f(y). In particular this means 
that every local minimum of f is also a global minimum. 
In contrast in a non-convex function there can be many 
local minima. 


Figure 12.5: In the high dimensional case, if f is a 
convex function (left figure) the global minimum 
is the only local minimum, and we can find it by 
a local-search algorithm which can be thought of 
as dropping a marble and letting it “slide down” 
until it reaches the global minimum. In contrast, a 
non-convex function (right figure) might have an 
exponential number of local minima in which any 
local-search algorithm could get stuck. 
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be a local minimum. Intuitively, local minima of functions are much 
easier to find than global ones: after all, any “local search” algorithm 
that keeps finding a nearby point on which the value is lower, will 
eventually arrive at a local minimum. One example of such a local 
search algorithm is gradient descent which takes a sequence of small 
steps, each one in the direction that would reduce the value by the 
most amount based on the current derivative. 

Indeed, under certain technical conditions, we can often efficiently 
find the minimum of convex functions over a convex domain, and 
this is the reason why problems such as minimum cut and shortest 
path are easy to solve. On the other hand, maximizing a convex func- 
tion over a convex domain (or equivalently, minimizing a concave 
function) can often be a hard computational task. A linear function 
is both convex and concave, which is the reason that both the maxi- 
mization and minimization problems for linear functions can be done 
efficiently. 

The minimum cut problem is not a priori a convex minimization 
task, because the set of potential cuts is discrete and not continuous. 
However, it turns out that we can embed it in a continuous and con- 
vex set via the (linear) maximum flow problem. The “max flow min 
cut” theorem ensures that this embedding is “tight” in the sense that 
the minimum “fractional cut” that we obtain through the maximum- 
flow linear program will be the same as the true minimum cut. Un- 
fortunately, we don’t know of such a tight embedding in the setting of 
the maximum cut problem. 

Convexity arises time and again in the context of efficient computa- 
tion. For example, one of the basic tasks in machine learning is empir- 
ical risk minimization. This is the task of finding a classifier for a given 
set of training examples. That is, the input is a list of labeled examples 
(£0, Yo)» +++» (Em1; Ym—1) Where each z; € {0,1}" andy, € {0,1}, 
and the goal is to find a classifier h : {0,1}" — {0,1} (or sometimes 
h : {0,1}" > R) that minimizes the number of errors. More generally, 
we want to find h that minimizes 


L(y;, h(x;)) 


where L is some loss function measuring how far is the predicted la- 
bel h(x;) from the true label y;. When L is the square loss function 
L(y,y’) = (y—y’)* and h is a linear function, empirical risk mini- 
mization corresponds to the well-known convex minimization task of 
linear regression. In other cases, when the task is non-convex, there can 
be many global or local minima. That said, even if we don’t find the 
global (or even a local) minima, this continuous embedding can still 
help us. In particular, when running a local improvement algorithm 
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such as Gradient Descent, we might still find a function h that is “use- 
ful” in the sense of having a small error on future examples from the 
same distribution. 


12.2 BEYOND GRAPHS 


Not all computational problems arise from graphs. We now list some 
other examples of computational problems that are of great interest. 


12.2.1 SAT 

A propositional formula ọ involves n variables x,,... ,x,, and the logical 
operators AND (A), OR (v), and NOT (~, also denoted as =). We say 
that such a formula is in conjunctive normal form (CNF for short) if it is 
an AND of ORs of variables or their negations (we call a term of the 
form z; or T; a literal). For example, this is a CNF formula 


(£7 V Tə V T15) ^ (£37 V £22) A (£55 V T7) 


The satisfiability problem is the task of determining, given a CNF 
formula y, whether or not there exists a satisfying assignment for p. A 
satisfying assignment for ọ is a string x € {0, 1}” such that ọ evalu- 
ates to True if we assign its variables the values of x. The SAT problem 
might seem as an abstract question of interest only in logic but in fact 
SAT is of huge interest in industrial optimization, with applications 
including manufacturing planning, circuit synthesis, software verifica- 
tion, air-traffic control, scheduling sports tournaments, and more. 


2SAT. We say that a formula is a k-CNF it is an AND of ORs where 
each OR involves exactly k literals. The k-SAT problem is the restric- 
tion of the satisfiability problem for the case that the input formula is 
a k-CNF. In particular, the 2SAT problem is to find out, given a 2-CNF 
formula y, whether there is an assignment x € {0,1}” that satisfies 

y, in the sense that it makes it evaluate to 1 or “True”. The trivial, 
brute-force, algorithm for 2SAT will enumerate all the 2” assignments 
x € {0,1}” but fortunately we can do much better. The key is that 
we can think of every constraint of the form 4; V £; (where ¢;,@; are 
literals, corresponding to variables or their negations) as an implication 
L=? j Since it corresponds to the constraints that if the literal ¢; = k; 
is true then it must be the case that £; is true as well. Hence we can 
think of y as a directed graph between the 2n literals, with an edge 
from £; to £; corresponding to an implication from the former to the 
latter. It can be shown that ¢ is unsatisfiable if and only if there is a 
variable x; such that there is a directed path from z; to x; as well as 

a directed path from 7; to x, (see Exercise 12.2). This reduces 2SAT 
to the (efficiently solvable) problem of determining connectivity in 
directed graphs. 
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3SAT. The 3SAT problem is the task of determining satisfiability 
for 3CNFs. One might think that changing from two to three would 
not make that much of a difference for complexity. One would be 
wrong. Despite much effort, we do not know of a significantly better 
than brute force algorithm for 3SAT (the best known algorithms take 
roughly 1.3” steps). 

Interestingly, a similar issue arises time and again in computation, 
where the difference between two and three often corresponds to 
the difference between tractable and intractable. We do not fully un- 
derstand the reasons for this phenomenon, though the notion of NP 
completeness we will see later does offer a partial explanation. It may 
be related to the fact that optimizing a polynomial often amounts to 
equations on its derivative. The derivative of a quadratic polynomial is 
linear, while the derivative of a cubic is quadratic, and, as we will see, 
the difference between solving linear and quadratic equations can be 
quite profound. 


12.2.2 Solving linear equations 

One of the most useful problems that people have been solving time 
and again is solving n linear equations in n variables. That is, solve 
equations of the form 


a0 00 T 401%1 are a0 n-1¥ n-1 = bo 
a1 ofo T 411%) = Qi n—-1fn-1 =b; 
Hes +: + =: 
n—1,0%0 T An—1,1%1 are Gr an—1ln-1 =bn 


where {a,j}; ,j¢[n] and {b;} icin] are real (or rational) numbers. More 
compactly, we can write this as the equations Ax = b where A is an 
n x n matrix, and we think of x, b are column vectors in R”. 

The standard Gaussian elimination algorithm can be used to solve 
such equations in polynomial time (i.e., determine if they have a so- 
lution, and if so, to find it). As we discussed above, if we are willing 
to allow some loss in precision, we even have algorithms that handle 
linear inequalities, also known as linear programming. In contrast, if 
we insist on integer solutions, the task of solving for linear equalities 
or inequalities is known as integer programming, and the best known 


algorithms are exponential time in the worst case. 
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12.2.3 Solving quadratic equations 


Suppose that we want to solve not just linear but also equations in- 
volving quadratic terms of the form a; j ,2;),- That is, suppose that 
we are given a set of quadratic polynomials p,, ... ,p,,, and consider 
the equations {p;(x) = 0}. To avoid issues with bit representations, 
we will always assume that the equations contain the constraints 

{x}? — x; = 0}jcjnj- Since only 0 and 1 satisfy the equation a? — a = 0, 
this assumption means that we can restrict attention to solutions in 
{0, 1}". Solving quadratic equations in several variables is a classical 
and extremely well motivated problem. This is the generalization of 
the classical case of single-variable quadratic equations that gener- 
ations of high school students grapple with. It also generalizes the 
quadratic assignment problem, introduced in the 1950’s as a way to 
optimize assignment of economic activities. Once again, we do not 
know a much better algorithm for this problem than the one that enu- 
merates over all the 2” possibilities. 


12.3 MORE ADVANCED EXAMPLES 


We now list a few more examples of interesting problems that are a 
little more advanced but are of significant interest in areas such as 
physics, economics, number theory, and cryptography. 


12.3.1 Determinant of a matrix 
The determinant of an x n matrix A, denoted by det(A), is an ex- 
tremely important quantity in linear algebra. For example, it is known 


412 INTRODUCTION TO THEORETICAL COMPUTER SCIENCE 


that det(A) # 0 if and only if A is non-singular, which means that it 
has an inverse A~', and hence we can always uniquely solve equations 
of the form Ax = b where x and b are n-dimensional vectors. More 
generally, the determinant can be thought of as a quantitative measure 
as to what extent A is far from being singular. If the rows of A are “al- 
most” linearly dependent (for example, if the third row is very close 
to being a linear combination of the first two rows) then the determi- 
nant will be small, while if they are far from it (for example, if they are 
are orthogonal to one another, then the determinant will be large). In 
particular, for every matrix A, the absolute value of the determinant 
of A is at most the product of the norms (i.e., square root of sum of 
squares of entries) of the rows, with equality if and only if the rows 
are orthogonal to one another. 

The determinant can be defined in several ways. One way to define 
the determinant of an n x n matrix A is: 


det(A) = X` sign(r) | [ Airy (12.2) 
i€[n] 


neS,, 
where S, is the set of all permutations from [n] to [n] and the sign of 
a permutation 7 is equal to —1 raised to the power of the number of 
inversions in m (pairs i, j such that i > j but n(i) < m(j)). 

This definition suggests that computing det(A) might require 
summing over |.S,,| terms which would take exponential time since 
|S a| = n! > 2”. However, there are other ways to compute the de- 
terminant. For example, it is known that det is the only function that 
satisfies the following conditions: 


1. det(AB) = det(A)det(B) for every square matrices A, B. 


2. Foreveryn x n triangular matrix T with diagonal entries 
do, --- , dp—1, det(T) = Ma d;. In particular det(I) = 1 where T is 
the identity matrix. (A triangular matrix is one in which either all 
entries below the diagonal, or all entries above the diagonal, are 


zero.) 
3. det(S) = —1 where S is a “swap matrix” that corresponds to 
swapping two rows or two columns of J. That is, there are two 


1 i=j,i¢ {a,b} 
coordinates a, b such that for every 7,7,5;;=41 {i,j} = {a,b} 


0 otherwise 


Using these rules and the Gaussian elimination algorithm, it is 
possible to tell whether A is singular or not, and in the latter case, de- 
compose A as a product of a polynomial number of swap matrices 
and triangular matrices. (Indeed one can verify that the row opera- 
tions in Gaussian elimination corresponds to either multiplying by a 
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swap matrix or by a triangular matrix.) Hence we can compute the 
determinant for an n x n matrix using a polynomial time of arithmetic 
operations. 


12.3.2 Permanent of a matrix 
Given an n x n matrix A, the permanent of A is defined as 


perm(A) = X` [J Aj aay - (12.3) 
TES, iE[n] 

That is, perm(A) is defined analogously to the determinant in (12.2) 
except that we drop the term sign(7). The permanent of a matrix is a 
natural quantity, and has been studied in several contexts including 
combinatorics and graph theory. It also arises in physics where it can 
be used to describe the quantum state of multiple Boson particles (see 
here and here). 


Permanent modulo 2. If the entries of A are integers, then we can de- 
fine the Boolean function perm, which outputs on input a matrix A 
the result of the permanent of A modulo 2. It turns out that we can 
compute perm,(A) in polynomial time. The key is that modulo 2, —x 
and +z are the same quantity and hence, since the only difference 
between (12.2) and (12.3) is that some terms are multiplied by —1, 
det(A) mod 2 =perm(A) mod 2 for every A. 


Permanent modulo 3. Emboldened by our good fortune above, we 
might hope to be able to compute the permanent modulo any prime p 
and perhaps in full generality. Alas, we have no such luck. In a similar 
“two to three” type of a phenomenon, we do not know of a much 
better than brute force algorithm to even compute the permanent 
modulo 3. 


12.3.3 Finding a zero-sum equilibrium 

A zero sum game is a game between two players where the payoff for 
one is the same as the penalty for the other. That is, whatever the first 
player gains, the second player loses. As much as we want to avoid 
them, zero sum games do arise in life, and the one good thing about 
them is that at least we can compute the optimal strategy. 

A zero sum game can be specified by an n x n matrix A, where if 
player 1 chooses action i and player 2 chooses action j then player one 
gets A; j and player 2 loses the same amount. The famous Min Max 
Theorem by John von Neumann states that if we allow probabilistic or 
“mixed” strategies (where a player does not choose a single action but 
rather a distribution over actions) then it does not matter who plays 
first and the end result will be the same. Mathematically the min max 
theorem is that if we let A, be the set of probability distributions over 
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[n] (i.e., non-negative columns vectors in R” whose entries sum to 1) 
then 


. aa . T 
max min p' Ag = min maxp' A 12.4 
peA,, EAn ae qceA,, peA,, ae ( ) 


The min-max theorem turns out to be a corollary of linear pro- 
gramming duality, and indeed the value of (12.4) can be computed 
efficiently by a linear program. 


12.3.4 Finding a Nash equilibrium 

Fortunately, not all real-world games are zero sum, and we do have 
more general games, where the payoff of one player does not neces- 
sarily equal the loss of the other. John Nash won the Nobel prize for 
showing that there is a notion of equilibrium for such games as well. 
In many economic texts it is taken as an article of faith that when 
actual agents are involved in such a game then they reach a Nash 
equilibrium. However, unlike zero sum games, we do not know of 
an efficient algorithm for finding a Nash equilibrium given the de- 
scription of a general (non-zero-sum) game. In particular this means 
that, despite economists’ intuitions, there are games for which natural 
strategies will take an exponential number of steps to converge to an 
equilibrium. 


12.3.5 Primality testing 

Another classical computational problem, that has been of interest 
since the ancient Greeks, is to determine whether a given number 
N is prime or composite. Clearly we can do so by trying to divide it 
with all the numbers in 2,..., N — 1, but this would take at least N 
steps which is exponential in its bit complexity n = log N. We can 
reduce these N steps to VN by observing that if N is a composite of 
the form N = PQ then either P or Q is smaller than VN. But this is 
still quite terrible. If N is a 1024 bit integer, VN is about 2°!2, and so 
running this algorithm on such an input would take much more than 
the lifetime of the universe. 

Luckily, it turns out we can do radically better. In the 1970’s, Ra- 
bin and Miller gave probabilistic algorithms to determine whether a 
given number N is prime or composite in time poly(n) for n = log N. 
We will discuss the probabilistic model of computation later in this 
course. In 2002, Agrawal, Kayal, and Saxena found a deterministic 
poly(n) time algorithm for this problem. This is surely a development 
that mathematicians from Archimedes till Gauss would have found 
exciting. 
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12.3.6 Integer factoring 

Given that we can efficiently determine whether a number N is prime 
or composite, we could expect that in the latter case we could also ef- 
ficiently find the factorization of N. Alas, no such algorithm is known. 
Ina surprising and exciting turn of events, the non-existence of such an 
algorithm has been used as a basis for encryptions, and indeed it un- 
derlies much of the security of the world wide web. We will return to 
the factoring problem later in this course. We remark that we do know 
much better than brute force algorithms for this problem. While the 
brute force algorithms would require 2°”) time to factor an n-bit inte- 
ger, there are known algorithms running in time roughly 20(v7) and 
also algorithms that are widely believed (though not fully rigorously 
analyzed) to run in time roughly 2O), (By “roughly” we mean that 
we neglect factors that are polylogarithmic in n.) 


12.4 OUR CURRENT KNOWLEDGE 


The difference between an exponential and polynomial time algo- 
rithm might seem merely “quantitative” but it is in fact extremely 
significant. As we've already seen, the brute force exponential time 
algorithm runs out of steam very very fast, and as Edmonds says, in 
practice there might not be much difference between a problem where 
the best algorithm is exponential and a problem that is not solvable 
at all. Thus the efficient algorithms we mentioned above are widely 
used and power many computer science applications. Moreover, a 
polynomial-time algorithm often arises out of significant insight to 
the problem at hand, whether it is the “max-flow min-cut” result, the 
solvability of the determinant, or the group theoretic structure that 
enables primality testing. Such insight can be useful regardless of its 
computational implications. 

At the moment we do not know whether the “hard” problems are 
truly hard, or whether it is merely because we haven't yet found the 
right algorithms for them. However, we will now see that there are 
problems that do inherently require exponential time. We just don’t 
know if any of the examples above fall into that category. 


©) Chapter Recap 


e There are many natural problems that have 
polynomial-time algorithms, and other natural 
problems that we’d love to solve, but for which the 
best known algorithms are exponential. 


e Often a polynomial time algorithm relies on dis- 
covering some hidden structure in the problem, or 
finding a surprising equivalent formulation for it. 


Figure 12.6: The current computational status of 
several interesting problems. For all of them we either 
know a polynomial-time algorithm or the known 
algorithms require at least 2”° for some c > 0. In 
fact for all except the factoring problem, we either 
know an O(n) time algorithm or the best known 
algorithm require at least 2°") time where n is a 
natural parameter such that there is a brute force 
algorithm taking roughly 2” or n! time. Whether this 
“cliff” between the easy and hard problem is a real 
phenomenon or a reflection of our ignorance is still an 
open question. 
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e There are many interesting problems where there 
is an exponential gap between the best known algo- 
rithm and the best algorithm that we can rule out. 
Closing this gap is one of the main open questions 
of theoretical computer science. 


12.5 EXERCISES 


Exercise 12.1 — exponential time algorithm for longest path. The naive algo- 
rithm for computing the longest path in a given graph could take 
more than n! steps. Give a poly(n)2” time algorithm for the longest 
path problem in n vertex graphs.* 


Exercise 12.2 — 2SAT algorithm. For every 2CNF ø, define the graph G, 
on 2n vertices corresponding to the literals x,,...,%,,,%1,...,%,, such 
that there is an edge £; £; iff the constraint £, V 4 is in y. Prove that p 
is unsatisfiable if and only if there is some i such that there is a path 
from z; to Z; and from 7; to x; in G. Show how to use this to solve 
2SAT in polynomial time. 


Exercise 12.3 — Reductions for showing algorithms. The following fact is 
true: there is a polynomial-time algorithm BIP that on input a graph 
G = (V, E) outputs 1 if and only if the graph is bipartite: there is a 
partition of V to disjoint parts S and T such that every edge (u,v) € E 
satisfies eitheru € Sandv € Toru € Tandv € S. Use this 

fact to prove that there is a polynomial-time algorithm to compute 
that following function CLIQUEPARTITION that on input a graph 

G = (V, E) outputs 1 if and only if there is a partition of V the graph 
into two parts S and T such that both S and T are cliques: for every 
pair of distinct vertices u,v € S, the edge (u, v) is in E and similarly for 
every pair of distinct vertices u, v € T, the edge (u, v) is in E. 


12.6 BIBLIOGRAPHICAL NOTES 


The classic undergraduate introduction to algorithms text is 
[Cor+09]. Two texts that are less “encyclopedic” are Kleinberg and 
Tardos [KT06], and Dasgupta, Papadimitriou and Vazirani [DPV08]. 
Jeff Erickson’s book is an excellent algorithms text that is freely 
available online. 

The origins of the minimum cut problem date to the Cold War. 
Specifically, Ford and Fulkerson discovered their max-flow/min-cut 
algorithm in 1955 as a way to find out the minimum amount of train 


? Hint: Use dynamic programming to compute for 
every s,t € [n] and S C [n] the value P(s, t, S) 
which equals 1 if there is a simple path from s to t 
that uses exactly the vertices in S. Do this iteratively 
for S’s of growing sizes. 
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tracks that would need to be blown up to disconnect Russia from the 
rest of Europe. See the survey [Sch05] for more. 

Some algorithms for the longest path problem are given in [Wil09; 
Bjol4]. 


12.7 FURTHER EXPLORATIONS 


Some topics related to this chapter that might be accessible to ad- 
vanced students include: (to be completed) 
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Learning Objectives: 


e Formally modeling running time, and in 
particular notions such as O(n) or O(n?) time 
algorithms. 


The classes P and EXP modelling polynomial 
and exponential time respectively. 
The time hierarchy theorem, that in particular 
says that for every k > 1 there are functions 
we can compute in O(n**!) time but can not 
1 3 compute in O(n") time. 
The class P poly of non-uniform computation 


‘i : g and the result that P C P, 
Modeling running time 


/poly 


“When the measure of the problem-size is reasonable and when the 
sizes assume values arbitrarily large, an asymptotic estimate of ... the or- 
der of difficulty of [an] algorithm .. is theoretically important. It cannot 
be rigged by making the algorithm artificially difficult for smaller sizes”, 
Jack Edmonds, “Paths, Trees, and Flowers”, 1963 


Max Newman: It is all very well to say that a machine could ... do this or 
that, but ... what about the time it would take to do it? 

Alan Turing: To my mind this time factor is the one question which will 
involve all the real technical difficulty. 

BBC radio panel on “Can automatic Calculating Machines Be Said to 
Think?”, 1952 


In Chapter 12 we saw examples of efficient algorithms, and made 
some claims about their running time, but did not give a mathemati- 
cally precise definition for this concept. We do so in this chapter, using 
the models of Turing machines and RAM machines (or equivalently 
NAND-TM and NAND-RAM) we have seen before. The running 
time of an algorithm is not a fixed number since any non-trivial algo- 
rithm will take longer to run on longer inputs. Thus, what we want 
to measure is the dependence between the number of steps the algo- 
rithms takes and the length of the input. In particular we care about 
the distinction between algorithms that take at most polynomial time 
(i.e., O(n°) time for some constant c) and problems for which every 
algorithm requires at least exponential time (i.e., Q(2"") for some c). As 
mentioned in Edmond’s quote in Chapter 12, the difference between 
these two can sometimes be as important as the difference between 
being computable and uncomputable. 
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Define running time, 
P and EXP 


TM & RAM 
bisimulation 


Universal Machine 
Time Hierarchy Theorem 


+ 


Non-uniform complexity 5 
P S Pjpoty 


To put this in more “mathy” language, in this chapter we define 
what it means for a function F : {0,1}* — {0,1}* to be computable 
in time T(n) steps, where T is some function mapping the length n 


of the input to the number of computation steps allowed. Using this 
definition we will do the following (see also Fig. 13.1): 


e We define the class P of Boolean functions that can be computed 
in polynomial time and the class EXP of functions that can be com- 
puted in exponential time. Note that P C EXP. If we can compute 
a function in polynomial time, we can certainly compute it in expo- 
nential time. 


e We show that the times to compute a function using a Turing ma- 
chine and using a RAM machine (or NAND-RAM program) are 
polynomially related. In particular this means that the classes P and 
EXP are identical regardless of whether they are defined using 
Turing machines or RAM machines / NAND-RAM programs. 


Figure 13.1: Overview of the results of this chapter. 
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e We give an efficient universal NAND-RAM program and use this to 
establish the time hierarchy theorem that in particular implies that P is 
a strict subset of EXP. 


e We relate the notions defined here to the non-uniform models of 
Boolean circuits and NAND-CIRC programs defined in Chapter 3. 
We define P poy to be the class of functions that can be computed 
by a sequence of polynomial-sized circuits. We prove that P C P poy 
and that P41, contains uncomputable functions. 


13.1 FORMALLY DEFINING RUNNING TIME 


Our models of computation (Turing machines, NAND-TM and 
NAND-RAM programs and others) all operate by executing a se- 
quence of instructions on an input one step at a time. We can define 
the running time of an algorithm M in one of these models by measur- 
ing the number of steps M takes on input x as a function of the length 
|x| of the input. We start by defining running time with respect to Tur- 
ing machines: 


Definition 13.1 — Running time (Turing Machines). Let T : N — N be some 
function mapping natural numbers to natural numbers. We say 
that a function F : {0,1}* — {0,1}* is computable in T(n) Turing- 
Machine time (TM-time for short) if there exists a Turing machine M 
such that for every sufficiently large n and every x € {0,1}", when 
given input x, the machine M halts after executing at most T(n) 
steps and outputs F(z). 

We define TIME,,,(T (n)) to be the set of Boolean functions 
(functions mapping {0, 1}* to {0, 1}) that are computable in T (n) 
TM time. 
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The relaxation of considering only “sufficiently large” n’s is not 
very important but it is convenient since it allows us to avoid dealing 
explicitly with un-interesting “edge cases”. 

While the notion of being computable within a certain running time 
can be defined for every function, the class TIME,,,(T(n)) is a class 
of Boolean functions that have a single bit of output. This choice is not 
very important, but is made for simplicity and convenience later on. 

In fact, every non-Boolean function has a computationally equivalent 
Boolean variant, see Exercise 13.3. 


Solved Exercise 13.1 — Example of time bounds. Prove that TIME,,(10-n®) C 
TIME s,(2”). 


Solution: 

The proof is illustrated in Fig. 13.2. Suppose that F € TIME,,,(10- 
n3) and hence there exist some number N, and a machine M such 
that foreveryn > No,anda € {0,1}*, M(x) outputs F(x) within 
at most 10 - n? steps. Since 10 - n? = o(2”), there is some number 
N; such that foreveryn > N,,10 +n? < 2”. Hence for every 
n > max{No, N,}, M(x) will output F(x) within at most 2” steps, 
demonstrating that F € TIME,,(2”). 


13.1.1 Polynomial and Exponential Time 
Unlike the notion of computability, the exact running time can be a 
function of the model we use. However, it turns out that if we only 
care about “coarse enough” resolution (as will most often be the case) 
then the choice of the model, whether Turing machines, RAM ma- 
chines, NAND-TM/NAND-RAM programs, or C/Python programs, 
does not matter. This is known as the extended Church-Turing Thesis. 
Specifically we will mostly care about the difference between polyno- 
mial and exponential time. 

The two main time complexity classes we will be interested in are 
the following: 


e Polynomial time: A function F : {0,1}* — {0,1} is computable in 
polynomial time if it is in the class P = Use{1,2,3,..} TIME ty (n°). That 
is, F € P if there is an algorithm to compute F that runs in time at 
most polynomial (i.e., at most n° for some constant c) in the length 
of the input. 


e Exponential time: A function F : {0,1}* — {0,1} is computable in 
exponential time if it is in the class EXP = U,c{1,2,3,} TIMEty(2” ). 
That is, F € EXP if there is an algorithm to compute F that runs in 


ve > TIMEpy(2”) 
1 TIME;y(10-n?) ^ 
~ 4 


Ta fire 


Figure 13.2: Comparing T(n) = 10n? with T’(n) = 
2” (on the right figure the Y axis is in log scale). 
Since for every large enough n, T’(n) > T(n), 
TIME sy (T(n)) © TIME (T (n)). 


time at most exponential (i.e., at most 2" for some constant c) in the 
length of the input. 


In other words, these are defined as follows: 


Definition 13.2 — P and EXP. Let F : {0,1}* — {0,1}. We say that 
F € Pifthereisapolynomialp : N — Randa Turing machine 
M such that forevery x € {0,1}*, when given input xz, the Turing 
machine halts within at most p(|z|) steps and outputs F(x). 

We say that F € EXP if there isa polynomial p : N — Rand 
a Turing machine M such that for everyx € {0,1}*, when given 
input z, M halts within at most 2?('*!) steps and outputs F(z). 


Solved Exercise 13.2 — Differerent definitions of P. Prove that P as defined in 
Definition 13.2 is equal to Use{1,2,3,..} TIME (n°) 


Solution: 

To show these two sets are equal we need to show that P = 
Uee{1,2,3,..} IME ty (n°) and Usc{1,2,3,.} IME y (n°) © P. We start 
with the former inclusion. Suppose that F € P. Then there is some 
polynomialp : N — Rand a Turing machine M such that M 
computes F and M halts on every input x within at most p(|x|) 
steps. We can write the polynomialp : N — Rin the form 
p(n) = san anë where ag,...,a@, © R, and we assume that a, 
is non-zero (or otherwise we just let d correspond to the largest 
number such that a, is non-zero). The degree of p is the number d. 
Sincen? = o(n“*), no matter what the coefficient a, is, for large 
enough n, p(n) < n+! which means that the Turing machine M 
will halt on inputs of length n within fewer than n“*! steps, and 
hence F € TIME qy(n**t) C Ucers2,3,...) TIME qy (n°). 

For the second inclusion, suppose that F € Use{1,2,3,... 
Then there is some positive c € N such that F € TIME,\y(n°) which 
means that there is a Turing machine M and some number Nọ such 
that M computes F and foreveryn > No, M halts on length n 


\TIMEm(n°). 
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inputs within at most n° steps. Let Tg be the maximum number 

of steps that M takes on inputs of length at most No. Then if we 
define the polynomial p(n) = n° + To then we see that M halts on 
every input x within at most p(|z|) steps and hence the existence of 
M demonstrates that F € P. 


Since exponential time is much larger than polynomial time, P C 
EXP. All of the problems we listed in Chapter 12 are in EXP, but as 
we've seen, for some of them there are much better algorithms that 
demonstrate that they are in fact in the smaller class P. 


P EXP (but not known to be in P) 
Shortest path Longest Path 

Min cut Max cut 

2SAT 3SAT 

Linear eqs Quad. eqs 

Zerosum Nash 


Determinant Permanent 
Primality Factoring 


Table : A table of the examples from Chapter 12. All these problems 
are in EXP but only the ones on the left column are currently known to 


be in P as well (i.e., they have a polynomial-time algorithm). See also 
Fig. 13.3. 


13.2 MODELING RUNNING TIME USING RAM MACHINES / NAND- 
RAM 


Turing machines are a clean theoretical model of computation, but 
do not closely correspond to real-world computing architectures. The 
discrepancy between Turing machines and actual computers does 
not matter much when we consider the question of which functions 
are computable, but can make a difference in the context of efficiency. 


EXP 


3SAT 
MAXCUT 

QUADEQ 
LONGPATH ? 2 9 


$ 
ae MINCUT 2SAT  LINEQ 
SHORTPATH P 


PRIMALITY 
DETERMINANT k 


. 
PERMANENT 


Figure 13.3: Some examples of problems that are 
known to be in P and problems that are known to 

be in EXP but not known whether or not they are 

in P. Since both P and EXP are classes of Boolean 
functions, in this figure we always refer to the Boolean 
(i.e., Yes/No) variant of the problems. 


Even a basic staple of undergraduate algorithms such as “merge sort” 
cannot be implemented on a Turing machine in O(n log n) time (see 
Section 13.8). RAM machines (or equivalently, NAND-RAM programs) 
match more closely actual computing architecture and what we mean 
when we say O(n) or O(n logn) algorithms in algorithms courses 

or whiteboard coding interviews. We can define running time with 
respect to NAND-RAM programs just as we did for Turing machines. 


Definition 13.4 — Running time (RAM). Let T : N — Nbe some func- 
tion mapping natural numbers to natural numbers. We say that 
a function F : {0,1}* — {0,1}* is computable in T (n) RAM time 
(RAM-time for short) if there exists a NAND-RAM program P such 
that for every sufficiently large n and every x € {0,1}", when given 
input x, the program P halts after executing at most T(n) lines and 
outputs F(x). 

We define TIME pay (T(n)) to be the set of Boolean functions 
(functions mapping {0, 1}* to {0,1}) that are computable in T (n) 
RAM time. 


Because NAND-RAM programs correspond more closely to our 
natural notions of running time, we will use NAND-RAM as our 
“default” model of running time, and hence use TIME(T(n)) (without 
any subscript) to denote TIMEpay(T (n)). However, it turns out that 
as long as we only care about the difference between exponential and 
polynomial time, this does not make much difference. The reason is 
that Turing machines can simulate NAND-RAM programs with at 
most a polynomial overhead (see also Fig. 13.4): 


Theorem 13.5 — Relating RAM and Turing machines. Let T : N —> Nbea 
function such that T(n) > n for every n and the map n > T(n) can 
be computed by a Turing machine in time O(T'(n)). Then 


TIME yy(T(n)) © TIMEpay(10-T(n)) C TIMEsy(T(n)*). (13.1) 
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For example, by instantiating Theorem 13.5 with T(n) = n° and 
using the fact that 10n* = o(n%*'), we see that TIME;,(n7) C 
TIME pay(n**) C TIME sy (n4¢+*) which means that (by Solved Ex- 
ercise 13.2) 


P= Ua=1,2,.. PIMEs y(n) = Uq=1,2,... PIMERay(n”) . 


That is, we could have equally well defined P as the class of functions 
computable by NAND-RAM programs (instead of Turing machines) 
that run in time polynomial in the length of the input. Similarly, by 
instantiating Theorem 13.5 with T(n) = 2”" we see that the class EXP 
can also be defined as the set of functions computable by NAND-RAM 
programs in time at most 2?) where p is some polynomial. Similar 
equivalence results are known for many models including cellular 
automata, C/Python/Javascript programs, parallel computers, and a 
great many other models, which justifies the choice of P as capturing 
a technology-independent notion of tractability. (See Section 13.3 

for more discussion of this issue.) This equivalence between Turing 
machines and NAND-RAM (as well as other models) allows us to 
pick our favorite model depending on the task at hand (i.e., “have our 
cake and eat it too”) even when we study questions of efficiency, as 
long as we only care about the gap between polynomial and exponential 


time. When we want to design an algorithm, we can use the extra 
power and convenience afforded by NAND-RAM. When we want 
to analyze a program or prove a negative result, we can restrict our 
attention to Turing machines. 


The adjective “reasonable” above refers to all scalable computa- 
tional models that have been implemented, with the possible excep- 
tion of quantum computers, see Section 13.3 and Chapter 23. 


Proof Idea: 

The direction TIME,,(T(n)) C TIME pay (10 - T(n)) is not hard to 
show, since a NAND-RAM program P can simulate a Turing machine 
M with constant overhead by storing the transition table of M in 


TIMEry (T(n)*) 
TIMEpam (T(n 


Figure 13.4: The proof of Theorem 13.5 shows that 
we can simulate T steps of a Turing machine with T 
steps of a NAND-RAM program, and can simulate 
T steps of a NAND-RAM program with 0(T*) 
steps of a Turing machine. Hence TIME- (T (n)) C 
TIME pam (10 - T(n)) C TIMEsy(T(n)*). 


an array (as is done in the proof of Theorem 9.1). Simulating every 
step of the Turing machine can be done in a constant number c of 
steps of RAM, and it can be shown this constant c is smaller than 
10. Thus the heart of the theorem is to prove that TIME pay (T(n)) C 
TIME,y(T'(n)*). This proof closely follows the proof of Theorem 8.1, 
where we have shown that every function F that is computable by 
a NAND-RAM program P is computable by a Turing machine (or 
equivalently a NAND-TM program) M. To prove Theorem 13.5, we 
follow the exact same proof but just check that the overhead of the 
simulation of P by M is polynomial. The proof has many details, but 
is not deep. It is therefore much more important that you understand 
the statement of this theorem than its proof. 

* 


Proof of Theorem 13.5. We only focus on the non-trivial direction 
TIME pay (T'(n)) C TIME sy(T'(n)*). Let F € TIMEpay(T(n)). F can 
be computed in time T(n) by some NAND-RAM program P and we 
need to show that it can also be computed in time T'(n)* by a Turing 
machine M. This will follow from showing that F can be computed 
in time T(n)* by a NAND-IM program, since for every NAND-TM 
program Q there is a Turing machine M simulating it such that each 
iteration of Q corresponds to a single step of M. 

As mentioned above, we follow the proof of Theorem 8.1 (simula- 
tion of NAND-RAM programs using NAND-IM programs) and use 
the exact same simulation, but with a more careful accounting of the 
number of steps that the simulation costs. Recall, that the simulation 
of NAND-RAM works by “peeling off” features of NAND-RAM one 
by one, until we are left with NAND-TM. 

We will not provide the full details but will present the main ideas 
used in showing that every feature of NAND-RAM can be simulated 
by NAND-ITM with at most a polynomial overhead: 


1. Recall that every NAND-RAM variable or array element can con- 
tain an integer between 0 and T where T is the number of lines that 
have been executed so far. Therefore if P isa NAND-RAM pro- 
gram that computes F in T (n) time, then on inputs of length n, all 
integers used by P are of magnitude at most T(n). This means that 
the largest value i can ever reach is at most T (n) and so each one of 
P’s variables can be thought of as an array of at most T(n) indices, 
each of which holds a natural number of magnitude at most T (n). 
We let £ = [log T(n)] be the number of bits needed to encode such 
numbers. (We can start off the simulation by computing T (n) and 
L.) 
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2. We can encode a NAND-RAM array of length < T(n) containing 
numbers in {0, ..., T(n) — 1} as an Boolean (i.e, NAND-TM) array 
of T(n)é = O(T(n) log T(n)) bits, which we can also think of as 
a two dimensional array as we did in the proof of Theorem 8.1. We 
encode a NAND-RAM scalar containing a number in {0, ... , T(n) — 
1} simply by a shorter NAND-TM array of £ bits. 


3. We can simulate the two dimensional arrays using one- 
dimensional arrays of length T(n)€ = O(T(n) log T(n)). All the 
arithmetic operations on integers use the grade-school algorithms, 
that take time that is polynomial in the number £ of bits of the 
integers, which is poly(log T (n)) in our case. Hence we can simulate 
T(n) steps of NAND-RAM with O(T(n)poly(log T(n)) steps of a 
model that uses random access memory but only Boolean-valued 
one-dimensional arrays. 


4. The most expensive step is to translate from random access mem- 
ory to the sequential memory model of NAND-TM/Turing ma- 
chines. As we did in the proof of Theorem 8.1 (see Section 8.2), we 
can simulate accessing an array Foo at some location encoded in an 
array Bar by: 


a. Copying Bar to some temporary array Temp 

b. Having an array Index which is initially all zeros except 1 at the 
first location. 

c. Repeating the following until Temp encodes the number 0: 
(Number of repetitions is at most T(n).) 


e Decrease the number encoded temp by 1. (Takes number of 
steps polynomial in £ = [log T(n)].) 

e Decrease i until it is equal to 0. (Takes O(T(n)) steps.) 

e Scan Index until we reach the point in which it equals 1 and 
then change this 1 to 0 and go one step further and write 1 in 
this location. (Takes O(T (n)) steps.) 


d. When we are done we know that if we scan Index until we reach 
the point in which Index[i]= 1 then i contains the value that 
was encoded by Bar (Takes O(T'(n)) steps.) 


The total cost for each such operation is O(T'(n)?+T (n)poly(log T(n))) = 
O(T(n)”) steps. 

In sum, we simulate a single step of NAND-RAM using 
O(T(n)?poly(log T(n))) steps of NAND-TM, and hence the total 
simulation time is O(T'(n)*poly(log T(n))) which is smaller than T(n)* 
for sufficiently large n. 

a 


13.3 EXTENDED CHURCH-TURING THESIS (DISCUSSION) 


Theorem 13.5 shows that the computational models of Turing machines 


and RAM machines / NAND-RAM programs are equivalent up to poly- 


nomial factors in the running time. Other examples of polynomially 


equivalent models include: 


All standard programming languages, including C/Python/- 
JavaScript/Lisp/etc. 


The å calculus (see also Section 13.8). 

Cellular automata 

Parallel computers 

Biological computing devices such as DNA-based computers. 


The Extended Church Turing Thesis is the statement that this is true 


for all physically realizable computing models. In other words, the 


extended Church Turing thesis says that for every scalable computing 


device C (which has a finite description but can be in principle used 


to run computation on arbitrarily large inputs), there is some con- 
stant a such that for every function F : {0,1}* — {0,1} that C can 
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compute on n length inputs using an S(n) amount of physical re- 
sources, F is in TIME(S(n)*). This is a strengthening of the (“plain”) 
Church-Turing Thesis, discussed in Section 8.8, which states that the 
set of computable functions is the same for all physically realizable 
models, but without requiring the overhead in the simulation between 
different models to be at most polynomial. 

All the current constructions of scalable computational models and 
programming languages conform to the Extended Church-Turing 
Thesis, in the sense that they can be simulated with polynomial over- 
head by Turing machines (and hence also by NAND-TM or NAND- 
RAM programs). Consequently, the classes P and EXP are robust to 
the choice of model, and we can use the programming language of 
our choice, or high level descriptions of an algorithm, to determine 
whether or not a problem is in P. 

Like the Church-Turing thesis itself, the extended Church-Turing 
thesis is in the asymptotic setting and does not directly yield an ex- 
perimentally testable prediction. However, it can be instantiated with 
more concrete bounds on the overhead, yielding experimentally- 
testable predictions such as the Physical Extended Church-Turing Thesis 
we mentioned in Section 5.6. 

In the last hundred+ years of studying and mechanizing com- 
putation, no one has yet constructed a scalable computing device 
that violates the extended Church Turing Thesis. However, quan- 
tum computing, if realized, will pose a serious challenge to the ex- 
tended Church-Turing Thesis (see Chapter 23). However, even if 
the promises of quantum computing are fully realized, the extended 
Church-Turing thesis is “morally” correct, in the sense that, while we 
do need to adapt the thesis to account for the possibility of quantum 
computing, its broad outline remains unchanged. We are still able 
to model computation mathematically, we can still treat programs 
as strings and have a universal program, we still have time hierarchy 
and uncomputability results, and there is still no reason to doubt the 
(“plain”) Church-Turing thesis. Moreover, the prospect of quantum 
computing does not seem to make a difference for the time complexity 
of many (though not all!) of the concrete problems that we care about. 
In particular, as far as we know, out of all the example problems men- 
tioned in Chapter 12 the complexity of only one— integer factoring— 
is affected by modifying our model to include quantum computers as 
well. 
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13.4 EFFICIENT UNIVERSAL MACHINE: A NAND-RAM INTER- 
PRETER IN NAND-RAM 


We have seen in Theorem 9.1 the “universal Turing machine”. Exam- 
ining that proof, and combining it with Theorem 13.5 , we can see that 
the program U has a polynomial overhead, in the sense that it can sim- 
ulate T steps of a given NAND-TM (or NAND-RAM) program P on 
an input z in O(T*) steps. But in fact, by directly simulating NAND- 
RAM programs we can do better with only a constant multiplicative 
overhead. That is, there is a universal NAND-RAM program U such that 
for every NAND-RAM program P, U simulates T steps of P using 
only O(T) steps. (The implicit constant in the O notation can depend 
on the program P but does not depend on the length of the input.) 


Theorem 13.7 — Efficient universality of NAND-RAM. There exists a NAND- 
RAM program U satisfying the following: 


1. (U is a universal NAND-RAM program.) For every NAND-RAM 
program P and input z,U(P,x) = P(x) where by U(P,2x) we 
denote the output of U ona string encoding the pair (P, x). 


2. (U is efficient.) There are some constants a, b such that for ev- 
ery NAND-RAM program P, if P halts on input x after at most 
T steps, then U (P, x) halts afteratmostC >- T steps where 
C < a| P|. 


block 0 block 1 block 2 


vars [3 ]2 [zjin] Jis[s 92]28] Je Jis 


s3 fas 
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Universal 
program U 
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Proof of Theorem 13.7. To present a universal NAND-RAM program 
in full we would need to describe a precise representation scheme, Figure 13.5: The universal NAND-RAM program 


as well as the full NAND-RAM instructions for the program. While U simulates an input NAND-RAM program P 

by storing all of P’s variables inside a single array 

, pi , : Vars of U. If P has t variables, then the array Vars 
so we just sketch the proof here. A specification of NAND-RAM is is divided into blocks of length t, where the j-th 
given in the appendix, and for the purposes of this simulation, we can coordinate of the i-th block contains the i-th element 


‘ : of the j-th array of P. If the j-th variable of P is 
simply use the representation of the NAND-RAM code as an ASCII scalar, then we just store its value in the zeroth block 


string. of Vars. 


this can be done, it is more important to focus on the main ideas, and 
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The program U gets as input a NAND-RAM program P and an 
input x and simulates P one step at a time. To do so, U does the fol- 
lowing: 


1. U maintains variables program_counter, and number_steps for the 
current line to be executed and the number of steps executed so far. 


2. U initially scans the code of P to find the number t of unique vari- 
able names that P uses. It will translate each variable name into a 
number between 0 and ¢ — 1 and use an array Program to store P’s 
code where for every line £, Program[@] will store the ¢-th line of P 
where the variable names have been translated to numbers. (More 
concretely, we will use a constant number of arrays to separately 
encode the operation used in this line, and the variable names and 
indices of the operands.) 


3. U maintains a single array Vars that contains all the values of P’s 
variables. We divide Vars into blocks of length t. If s is a num- 
ber corresponding to an array variable Foo of P, then we store 
Foo[Q] in Vars[s], we store Foo[1] in Var_values[t + s], Foo[2] 
in Vars[2t + s] and so on and so forth (see Fig. 13.5). Generally, 
if the s-th variable of P is a scalar variable, then its value will be 
stored in location Vars[s]. If it is an array variable then the value of 
its i-th element will be stored in location Vars[t-i+ s]. 


4. To simulate a single step of P, the program U recovers from Pro- 
gram the line corresponding to program_counter and executes it. 
Since NAND-RAM has a constant number of arithmetic operations, 
we can implement the logic of which operation to execute using a 
sequence of a constant number of if-then-else’s. Retrieving from 
Vars the values of the operands of each instruction can be done 
using a constant number of arithmetic operations. 


The setup stages take only a constant (depending on |P| but not 
on the input x) number of steps. Once we are done with the setup, to 
simulate a single step of P, we just need to retrieve the corresponding 
line and do a constant number of “if elses” and accesses to Vars to 
simulate it. Hence the total running time to simulate T steps of the 
program P is at most O(T) when suppressing constants that depend 
on the program P. 


13.4.1 Timed Universal Turing Machine 

One corollary of the efficient universal machine is the following. 
Given any Turing machine M, input x, and “step budget” T, we can 
simulate the execution of M for T steps in time that is polynomial in 


T. Formally, we define a function TIMEDEVAL that takes the three 
parameters M, x, and the time budget, and outputs M(x) if M halts 
within at most T steps, and outputs 0 otherwise. The timed univer- 
sal Turing machine computes TIMEDEVAL in polynomial time (see 
Fig. 13.6). (Since we measure time as a function of the input length, 
we define TIMEDEVAL as taking the input T represented in unary: a 
string of T ones.) 


Theorem 13.8 — Timed Universal Turing Machine. Let TIMEDEVAL 
{0,1}* — {0,1}* be the function defined as 


M(x) M halts within < T steps on z 


TIMEDEVAL(M, x, 17) = 
0 otherwise 


Then TIMEDEVAL € P. 


Proof. We only sketch the proof since the result follows fairly directly 
from Theorem 13.5 and Theorem 13.7. By Theorem 13.5 to show that 
TIMEDEVAL € P, it suffices to give a polynomial-time NAND-RAM 
program to compute TIMEDEVAL. 

Such a program can be obtained as follows. Given a Turing ma- 
chine M, by Theorem 13.5 we can transform it in time polynomial in 
its description into a functionally-equivalent NAND-RAM program 
P such that the execution of M on T steps can be simulated by the 
execution of P onc - T steps. We can then run the universal NAND- 
RAM machine of Theorem 13.7 to simulate P for c - T steps, using 
O(T) time, and output 0 if the execution did not halt within this bud- 
get. This shows that TIMEDEVAL can be computed by a NAND-RAM 
program in time polynomial in |M| and linear in T, which means 
TIMEDEVAL €P. 


13.5 THE TIME HIERARCHY THEOREM 


Some functions are uncomputable, but are there functions that can 

be computed, but only at an exorbitant cost? For example, is there a 
function that can be computed in time 2”, but can not be computed in 
time 20-9”? It turns out that the answer is Yes: 


Theorem 13.9 — Time Hierarchy Theorem. For every nice function T 
N > N, there is a function F : {0,1}* > {0,1} in TIME(T(n) log n) \ 
TIME(T(n)). 


There is nothing special about log n, and we could have used any 
other efficiently computable function that tends to infinity with n. 
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Figure 13.6: The timed universal Turing machine takes 
as input a Turing machine M, an input z, and a time 
bound T, and outputs M(x) if M halts within at 
most T steps. Theorem 13.8 states that there is such a 
machine that runs in time polynomial in T. 
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Figure 13.7: The Time Hierarchy Theorem (Theo- 
TIME (2? ) rem 13.9) states that all of these classes are distinct. 


TIME(2") TIME (2””) 


Proof Idea: 

In the proof of Theorem 9.6 (the uncomputability of the Halting 
problem), we have shown that the function HALT cannot be com- 
puted in any finite time. An examination of the proof shows that it 
gives something stronger. Namely, the proof shows that if we fix our 
computational budget to be T steps, then not only can we not dis- 
tinguish between programs that halt and those that do not, but we 
cannot even distinguish between programs that halt within at most T” 
steps and those that take more than that (where T’ is some number 
depending on T). Therefore, the proof of Theorem 13.9 follows the 
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ideas of the uncomputability of the halting problem, but again with a 
more careful accounting of the running time. 
* 


Proof of Theorem 13.9. Our proof is inspired by the proof of the un- 
computability of the halting problem. Specifically, for every function 
T as in the theorem’s statement, we define the Bounded Halting func- 
tion HALT, as follows. The input to HALT 7 is a pair (P, x) such that 
|P| < loglog|z| and P encodes some NAND-RAM program. We 
define 


1, P halts on x within < 100 - T(|P| + |x|) steps 
HALT; (P, 2) = l 
0, otherwise 


(The constant 100 and the function log log n are rather arbitrary, and 
are chosen for convenience in this proof.) 

Theorem 13.9 is an immediate consequence of the following two 
claims: 

Claim 1: HALT; € TIME(T (n) - logn) 

and 

Claim 2: HALT, ¢ TIME(T(n)). 

Please make sure you understand why indeed the theorem follows 
directly from the combination of these two claims. We now turn to 
proving them. 

Proof of claim 1: We can easily check in linear time whether an 
input has the form P, x where |P| < loglog |x|. Since T(-) is a nice 
function, we can evaluate it in O(T'(n)) time. Thus, we can compute 
HALT ,(P, x) as follows: 


1. Compute Ty = T(|P| + |x|) in O(T,) steps. 


2. Use the universal NAND-RAM program of Theorem 13.7 to simu- 
late 100-T, steps of P on the input x using at most poly(|P|)T, steps. 
(Recall that we use poly(¢) to denote a quantity that is bounded by 
al? for some constants a, b.) 


3. If P halts within these 100 - Tọ steps then output 1, else output 0. 


The length of the input ism = |P| + |z|. Since |x| < n and 
(log log |x|)? = o(log |z|) for every b, the running time will be 
o(T'(|P| + |z|) logn) and hence the above algorithm demonstrates that 
HALT, € TIME(T (n) - logn), completing the proof of Claim 1. 

Proof of claim 2: This proof is the heart of Theorem 13.9, and is 
very reminiscent of the proof that HALT is not computable. Assume, 
for the sake of contradiction, that there is some NAND-RAM program 
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P* that computes HALT; (P, x) within T(|P|+ |z|) steps. We are going 
to show a contradiction by creating a program Q and showing that 
under our assumptions, if Q runs for less than T(n) steps when given 
(a padded version of) its own code as input then it actually runs for 
more than T (n) steps and vice versa. (It is worth re-reading the last 
sentence twice or thrice to make sure you understand this logic. It is 
very similar to the direct proof of the uncomputability of the halting 
problem where we obtained a contradiction by using an assumed 
“halting solver” to construct a program that, given its own code as 
input, halts if and only if it does not halt.) 

We will define Q* to be the program that on input a string z does 
the following: 


1. If z does not have the form z = P1™ where P represents a NAND- 
RAM program and |P| < 0.1loglog m then return 0. (Recall that 
1” denotes the string of m ones.) 


2. Compute b = P*(P, z) (at a cost of at most T(|P| + |z|) steps, under 
our assumptions). 


3. If b = 1 then Q* goes into an infinite loop, otherwise it halts. 


Let £ be the length description of Q* as a string, and let m be larger 
than 22", We will reach a contradiction by splitting into cases ac- 
cording to whether or not HALT.(Q*, Q*1™) equals 0 or 1. 

On the one hand, if HALT -(Q*, Q*1™) = 1, then under our as- 
sumption that P* computes HALT, Q* will go into an infinite loop 
on input z = Q*1™, and hence in particular Q* does not halt within 
100T(|Q*| + m) steps on the input z. But this contradicts our assump- 
tion that HALT (Q*,Q*1™) = 1. 

This means that it must hold that HALT,,(Q*,Q*1™) = 0. But 
in this case, since we assume P* computes HALT, Q* does not do 
anything in phase 3 of its computation, and so the only computa- 
tion costs come in phases 1 and 2 of the computation. It is not hard 
to verify that Phase 1 can be done in linear and in fact less than 5|z| 
steps. Phase 2 involves executing P*, which under our assumption 
requires T(|Q*| + m) steps. In total we can perform both phases in 
less than 10T(|Q*| + m) in steps, which by definition means that 
HALT (Q*,Q*1™) = 1, but this is of course a contradiction. This 
completes the proof of Claim 2 and hence of Theorem 13.9. 


Solved Exercise 13.3 — P vs EXP. Prove that P Ç EXP. 
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Solution: 

We show why this statement follows from the time hierarchy 
theorem, but it can be an instructive exercise to prove it directly, 
see Remark 13.10. We need to show that there exists F € EXP \ P. 
Let T(n) = n°8” and T(n) = n!8”/2, Both are nice functions. 
Since T(n)/T’(n) = w(logn), by Theorem 13.9 there exists some 
F in TIME(T(n))/TIME(T" (n)). Since for sufficiently large n, 
2” > lesen, F e TIME(2") C EXP. On the otherhand,F ¢ P. 
Indeed, suppose otherwise that there was a constante > Oand 
a Turing machine computing F on n-length input in at most n° 
steps for all sufficiently large n. Then since for n large enough 
n° < nles"/?, it would have followed that F € TIME(n!°8"/?) 
contradicting our choice of F. 


The time hierarchy theorem tells us that there are functions we can 
compute in O(n?) time but not O(n), in 2” time, but not 2v™ etc.. In 
particular there are most definitely functions that we can compute in 
time 2” but not O(n). We have seen that we have no shortage of natu- 
ral functions for which the best known algorithm requires roughly 2” 
time, and that many people have invested significant effort in trying 
to improve that. However, unlike in the finite vs. infinite case, for all 
of the examples above at the moment we do not know how to rule 
out even an O(n) time algorithm. We will however see that there is a 
single unproven conjecture that would imply such a result for most of 
these problems. 

The time hierarchy theorem relies on the existence of an efficient 
universal NAND-RAM program, as proven in Theorem 13.7. For 
other models such as Turing machines we have similar time hierarchy 
results showing that there are functions computable in time T(n) and 
not in time T(n)/ f(n) where f(n) corresponds to the overhead in the 
corresponding universal machine. 


Figure 13.8: Some complexity classes and some of the 
functions we know (or conjecture) to be contained in 


13.6 NON-UNIFORM COMPUTATION them: 


We have now seen two measures of “computation cost” for functions. 
In Section 4.6 we defined the complexity of computing finite functions 
using circuits / straightline programs. Specifically, for a finite function 
g : {0,1}” — {0, 1} and number s € N, g € SIZE, (s) if there is a circuit 
of at most s NAND gates (or equivalently an s-line NAND-CIRC 
program) that computes g. To relate this to the classes TIME (T (n)) 
defined in this chapter we first need to extend the class SIZE,,(s) from 
finite functions to functions with unbounded input length. 
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Definition 13.11 — Non-uniform computation. Let F : {0,1}* — {0,1} 
andT : N — Nbea nice time bound. Foreveryn € N, define 
Fy, + {0,1}" — {0,1} to be the restriction of F to inputs of size n. 
That is, F,,, is the function mapping {0, 1}” to {0,1} such that for 
every x € {0,1}", Finx) = F(z). 

We say that F is non-uniformly computable in at most T (n) size, 
denoted by F €e SIZE(T) if there exists a sequence (Cp, Cy, Cy, ...) 
of NAND circuits such that: 


e For every n € N, C,, computes the function Fyn 


e For every sufficiently large n, C„ has at most T(n) gates. 


In other words, F € SIZE(T) iff for every n € N, it holds that 
Fyn € SIZE,,(T(n)). The non-uniform analog to the class P is the class 
P poly defined as 


Pie Un SIZ): (13.2) 


There is a big difference between non-uniform computation and uni- 
form complexity classes such as TIME(T'(n)) or P. The condition 

F € P means that there is a single Turing machine M that computes 
F on all inputs in polynomial time. The condition F € P/,,1, only 
means that for every input length n there can be a different circuit C, 
that computes F using polynomially many gates on inputs of these 
lengths. As we will see, F € Poy does not necessarily imply that 
F € P. However, the other direction is true: 


Theorem 13.12 — Non-uniform computation contains uniform computa- 
tion. Thereissomea € WN<s.t. for every niceT : N — Nand 
F : {0,1}* > {0, 1}, 


TIME(T(n)) C SIZE(T(n)*) . 


In particular, Theorem 13.12 shows that for every c, TIME (n°) C 
SIZE(n°) and hence P C P poly: 


Proof Idea: 

The idea behind the proof is to “unroll the loop”. Specifically, we 
will use the programming language variants of non-uniform and uni- 
form computation: namely NAND-CIRC and NAND-TM. The main 
difference between the two is that NAND-TM has loops. However, for 
every fixed n, if we know that a NAND-IM program runs in at most 
T(n) steps, then we can replace its loop by simply “copying and past- 
ing” its code T (n) times, similar to how in Python we can replace code 
such as 


Fy: (0, 1)" > (0,1) 


Figure 13.9: We can think of an infinite function 

F : {0,1}* > {0, 1} as a collection of finite functions 
Fo, Fi, Fo,... where Fyn : {0,1}" — {0,1} is the 
restriction of F' to inputs of length n. We say F is in 
P poly if for every n, the function Fn is computable 
by a polynomial-size NAND-CIRC program, or 
equivalently, a polynomial-sized Boolean circuit. 
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for i in range(4): 
print (i) 


with the “loop free” code 


print(@) 
print(1) 
print(2) 
print(3) 


To make this idea into an actual proof we need to tackle one tech- 
nical difficulty, and this is to ensure that the NAND-IM program is 
oblivious in the sense that the value of the index variable i in the j-th 
iteration of the loop will depend only on j and not on the contents of 
the input. We make a digression to do just that in Section 13.6.1 and 
then complete the proof of Theorem 13.12. 

* 


13.6.1 Oblivious NAND-TM programs 

Our approach for proving Theorem 13.12 involves “unrolling the 
loop”. For example, consider the following NAND-TM to compute the 
XOR function on inputs of arbitrary length: 


temp_@ = NAND(X[@], X[@]) 

Y_nonblank[@] = NAND(X[Q], temp_@) 
temp_2 = NAND(X([i],Y(@]) 

temp_3 = NAND(X[i], temp_2) 

temp_4 = NAND(Y[Q@], temp_2) 

YC] = NAND(temp_3, temp_4) 

MODANDJUMP (X_nonblank[iJ], X_nonblankLi]) 


Setting (as an example) n = 3, we can attempt to translate this 
NAND-IM program into a NAND-CIRC program for computing 
XOR; : {0,1}* — {0,1} by simply “copying and pasting” the loop 
three times (dropping the MODANDJMP line): 


temp_@ = NAND(X[@], X[@]) 
Y_nonblank[@] = NAND(X[Q], temp_@) 
temp_2 = NAND(X([i], Y[@]) 

temp_3 = NAND(X[i], temp_2) 

temp_4 = NAND(Y[@], temp_2) 

YC] = NAND(temp_3, temp_4) 

temp_@ = NAND(X[@], X[@]) 
Y_nonblank[@] = NAND(X[Q], temp_@) 
temp_2 = NAND(X([i],Y[0]) 

temp_3 = NAND(X[i], temp_2) 
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temp_4 = NAND(Y[Q], temp_2) 

Y[Q] = NAND(temp_3, temp_4) 

temp_@ = NAND(X[@],X[@]) 
Y_nonblank[@] = NAND(X[Q], temp_@) 


temp_2 = NAND(X[i], Y[@]) 
temp_3 = NAND(X[i], temp_2) 
temp_4 = NAND(Y[Q@], temp_2) 


YC] = NAND(temp_3, temp_4) 


However, the above is still not a valid NAND-CIRC program since 
it contains references to the special variable i. To make it into a valid 
NAND-CIRC program, we replace references to i in the first iteration 
with 0, references in the second iteration with 1, and references in the 
third iteration with 2. (We also create a variable zero and use it for the 
first time any variable is instantiated, as well as remove assignments to 
non-output variables that are never used later on.) The resulting pro- 
gram is a standard “loop free and index free” NAND-CIRC program 
that computes XOR, (see also Fig. 13.10): 


temp_®@ = NAND(X[@],X[@]) 
one = NAND(X[Q], temp_Q) 
zero = NAND(one, one) 
temp_2 = NAND(X[Q], zero) 
temp_3 = NAND(X[Q@], temp_2) 
temp_4 = NAND(zero, temp_2) 
Y[Q] = NAND(temp_3, temp_4) 
temp_2 = NAND(X[1],Y(@]) 
temp_3 = NAND(X[1], temp_2) 
temp_4 = NAND(Y[Q], temp_2) 
Y[Q] = NAND(temp_3, temp_4) 


temp_2 = NAND(X[2],Y(@]) 
temp_3 = NAND(X[2], temp_2) 
temp_4 = NAND(Y[@], temp_2) 


YC] = NAND(temp_3, temp_4) 


Key to this transformation was the fact that in our original NAND- 
TM program for XOR, regardless of whether the input is 011, 100, or 
any other string, the index variable i is guaranteed to equal 0 in the 
first iteration, 1 in the second iteration, 2 in the third iteration, and so 
on and so forth. The particular sequence 0, 1, 2, ... is immaterial: the 
crucial property is that the NAND-IM program for XOR is oblivious 
in the sense that the value of the index i in the j-th iteration depends 
only on j and does not depend on the particular choice of the input. 
Luckily, it is possible to transform every NAND-IM program into 
a functionally equivalent oblivious program with at most quadratic 


Figure 13.10: A NAND circuit for XOR; obtained by 
“unrolling the loop” of the NAND-TM program for 
computing XOR three times. 


overhead. (Similarly we can transform any Turing machine into a 
functionally equivalent oblivious Turing machine, see Exercise 13.6.) 


Theorem 13.13 — Making NAND-TM oblivious. Let T : N — WN beanice 
functionand let F € TIME,,(T'(n)). Then there isa NAND-TM 
program P that computes F in O(T(n)?) steps and satisfying the 
following. For every n €E N there is a sequence ig, i1, --- , Îm—1 such 
that for everyx € {0,1}",if P is executed on input z then in the 
j-th iteration the variable i is equal to i,. 


In other words, Theorem 13.13 implies that if we can compute F in 
T(n) steps, then we can compute it in O(T'(n)*) steps with a program 
P in which the position of i in the j-th iteration depends only on j 
and the length of the input, and not on the contents of the input. Such 
a program can be easily translated into a NAND-CIRC program of 
O(T(n)?) lines by “unrolling the loop”. 


Proof Idea: 

We can translate any NAND-TM program P’ into an oblivious 
program P by making P “sweep” its arrays. That is, the index i in 
P will always move all the way from position 0 to position T(n) — 1 
and back again. We can then simulate the program P’ with at most 
T(n) overhead: if P’ wants to move i left when we are in a rightward 
sweep then we simply wait the at most 2T (n) steps until the next time 
we are back in the same position while sweeping to the left. 

* 


Proof of Theorem 13.13. Let P’ bea NAND-IM program computing F 
in T(n) steps. We construct an oblivious NAND-IM program P for 
computing F as follows (see also Fig. 13.11). 


1. On input z, P will compute T = T(|x|) and set up arrays Atstart 
and Atend satisfying Atstart[0]= 1 and Atstart[i]= 0 fori > 0 
and Atend[T — 1]= 1 and Atend[i]= 0 for all i #4 T — 1. We can do 
this because T is a nice function. Note that since this computation 
does not depend on z but only on its length, it is oblivious. 


2. P will also have a special array Marker initialized to all zeroes. 


3. The index variable of P will change direction of movement to 
the right whenever AtstartLiJ= 1 and to the left whenever 
Atend[i]= 1. 


4. The program P simulates the execution of P’. However, if the 
MODANDJMP instruction in P’ attempts to move to the right when P 
is moving left (or vice versa) then P will set Marker[i] to 1 and 
enter into a special “waiting mode”. In this mode P will wait until 
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Figure 13.11: We simulate a T (n)-time NAND-TM 
program P’ with an oblivious NAND-TM program P 
by adding special arrays Atstart and Atend to mark 
positions 0 and T — 1 respectively. The program P 
will simply “sweep” its arrays from right to left and 
back again. If the original program P’ would have 
moved i ina different direction then we wait O(T) 
steps until we reach the same point back again, and so 
P runs in O(T(n)?) time. 
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the next time in which Marker[iJ]= 1 (at the next sweep) at which 
points P zeroes Marker[i] and continues with the simulation. In 
the worst case this will take 2T (n) steps (if P has to go all the way 
from one end to the other and back again.) 


5. We also modify P to ensure it ends the computation after simu- 
lating exactly T(n) steps of P’, adding “dummy steps” if P’ ends 
early. 


We see that P simulates the execution of P’ with an overhead of 
O(T(n)) steps of P per one step of P’, hence completing the proof. 
| 


Theorem 13.13 implies Theorem 13.12. Indeed, if P is a k-line obliv- 
ious NAND-TM program computing F in time T (n) then for every n 
we can obtain a NAND-CIRC program of (k — 1) - T(n) lines by simply 
making T(n) copies of P (dropping the final MODANDJMP line). In the 
j-th copy we replace all references of the form Foo[i] to foo_i; where 
1, is the value of i in the j-th iteration. 


13.6.2 “Unrolling the loop”: algorithmic transformation of Turing Machines 
to circuits 
The proof of Theorem 13.12 is algorithmic, in the sense that the proof 
yields a polynomial-time algorithm that given a Turing machine M 
and parameters T and n, produces a circuit of O(T”) gates that agrees 
with M on all inputs x € {0,1}” (as long as M runs for less than T 
steps these inputs.) We record this fact in the following theorem, since 
it will be useful for us later on: 


Theorem 13.14 — Turing-machine to circuit compiler. There is algorithm 
UNROLL such that for every Turing machine M and numbers n, T, 
UNROLL(M, 17,1") runs for poly(|M|, T, n) steps and outputs a 
NAND circuit C with n inputs, O(T?) gates, and one output, such 
that 


Cle) = T M halts in < T steps and outputs y 


0 otherwise 


Proof. We only sketch the proof since it follows by directly translat- 
ing the proof of Theorem 13.12 into an algorithm together with the 
simulation of Turing machines by NAND-TM programs (see also 
Fig. 13.13). Specifically, UNROLL does the following: 


1. Transform the Turing machine M into an equivalent NAND-TM 
program P. 


M | 


EE 


n 


141111111111111111111111111111111111111 


T 


T 


n C 
1411111111111111111111111111 


UNROLL(M,1",17) =C 
C(x) = Mer steps(*) 


Figure 13.12: The function UNROLL takes as input a 
Turing machine M, an input length parameter n, a 
step budget parameter T, and outputs a circuit C of 
size poly(T) that takes n bits of inputs and outputs 
M(x) if M halts on x within at most T steps. 


2. Transform the NAND-IM program P into an equivalent oblivious 
program P’ following the proof of Theorem 13.13. The program P’ 
takes T” = O(T?) steps to simulate T steps of P. 


3. “Unroll the loop” of P’ by obtaining a NAND-CIRC program of 
O(T’) lines (or equivalently a NAND circuit with O(T) gates) 
corresponding to the execution of T” iterations of P’. 


Turing NAND-TM Oblivious 
Machine Program NAND-TM 


temp_e = NAND(X{8},X[2}) 
[8] = NANO(X{®), temp_@) 

] = NAND(XT@], temp_@) 
NAND(X_nonblank[4],X_noblank{4]) 
|X nonblank[i}) 


temp_@ = NAND(X[@],x[@]) 
Connon BoOnO temp_2 = 
PEEREEENEI temp_3 = NANO(X[i],temp_2) D(A ponblenk{] 
$ > temp_4 = NAND(Y[0],temp_2) > naocala] 
Y[0] = NAND(temp_3,temp_4) NAND(X[i], = 
MODANDJUMP(X_nonblank [3] ,X_nonblank[1]) enp_4 = NANO(YI@], tenp_2) 
YLO] = NAND( ten 
HODANDIUMP (X nor 


p_3,temp_4) 
mblank[1],X_nonblank[1]) 


tenp_@ = NAND(X[0],x[0]) 

“Unrolled” vereiste] = Woite]. tenn 
Kastar iol = Wi 

NAND-CIRC rar 


Program 


[>= 
———— 5 t: 


3 ) 
NAND(X[@],x[6]) 
snk[®] » NANO(X[O], temp_0) 

NAND ) 


NAND 
Circuit 


NOt ) 

NANO (K[@),x{2]) 

nk[0] = NAND(X[O], temp_0) 
NaN 


tenp. 
vie 


2_4 = NAND(Y{ 8], temp_2) 
= NAND(temp_3,tenp_4) 


Solved Exercise 13.4 — Alternative characterization of P. Prove that for every 
F : {0,1}* > {0,1}, F € Pifand only if there is a polynomial- 
time Turing machine M such that for every n € N, M(1”) outputs a 
description of an n input circuit C,, that computes the restriction F}, 
of F to inputs in {0, 1}”. 
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Figure 13.13: We can transform a Turing machine M, 
input length parameter n, and time bound T into an 
O(T?)-sized NAND circuit that agrees with M on 
all inputs x € {0,1}” on which M halts in at most 
T steps. The transformation is obtained by first using 
the equivalence of Turing machines and NAND- 
TM programs P, then turning P into an equivalent 
oblivious NAND-TM program P’ via Theorem 13.13, 
then “unrolling” O(T?) iterations of the loop of 

P’ to obtain an O(T”) line NAND-CIRC program 
that agrees with P’ on length n inputs, and finally 
translating this program into an equivalent circuit. 
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Solution: 

We start with the “if” direction. Suppose that there is a polynomial- 
time Turing machine M that on input 1” outputs a circuit C,, that 
computes F,,,. Then the following is a polynomial-time Turing 
machine M’ to compute F. On input x € {0,1}*, M’ will: 


1. Letn = |x| and compute C,, = M(1”). 
2. Return the evaluation of C,, on x. 


Since we can evaluate a Boolean circuit on an input in poly- 
nomial time, M’ runs in polynomial time and computes F(x) on 
every input x. 

For the “only if” direction, if M” is a Turing machine that com- 
putes F in polynomial-time, then (applying the equivalence of 
Turing machines and NAND-TM as well as Theorem 13.13) there is 
also an oblivious NAND-IM program P that computes F in time 
p(n) for some polynomial p. We can now define M to be the Turing 
machine that on input 1” outputs the NAND circuit obtained by 
“unrolling the loop” of P for p(n) iterations. The resulting NAND 
circuit computes Fy, and has O(p(n)) gates. It can also be trans- 
formed to a Boolean circuit with O(p(n)) AND/OR/NOT gates. 


Solved Exercise 13.5 — P poty characterization by advice. Let F : {0,1}* > 
{0, 1}. Then F € P pory if and only if there exists a polynomial p : N > 
N, a polynomial-time Turing machine M and a sequence {a,,},,<y of 
strings, such that for every n € N: 


e |a,,| < p(n) 
e For every x € {0,1}", M (a„, £) = F(a). 


Solution: 

We only sketch the proof. For the “only if” direction, if F E 
P poly then we can use for a, simply the description of the cor- 
responding circuit C,, and for M the program that computes in 
polynomial time the evaluation of a circuit on its input. 

For the “if” direction, we can use the same “unrolling the loop” 
technique of Theorem 13.12 to show that if P is a polynomial-time 
NAND-IM program, then for every n € N, the map z  P(a,,,2) 
can be computed by a polynomial-size NAND-CIRC program Q,,. 
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13.6.3 Can uniform algorithms simulate non-uniform ones? 

Theorem 13.12 shows that every function in TIME(T(n)) is in 
SIZE(poly(T'(n))). One can ask if there is an inverse relation. Suppose 
that F is such that F,„ has a “short” NAND-CIRC program for every 
n. Can we say that it must be in TIME(T'(n)) for some “small” T? The 
answer is an emphatic no. Not only is P pory not contained in P, in fact 
P poly contains functions that are uncomputable! 


Theorem 13.15 — P poly, contains uncomputable functions. There exists an 
uncomputable function F : {0,1}* — {0,1} such that F € P poly- 


Proof Idea: 

Since P/,,.1, corresponds to non-uniform computation, a function 
F is in P poy if for every n € N, the restriction F}, to inputs of length 
n has a small circuit/program, even if the circuits for different values 
of n are completely different from one another. In particular, if F has 
the property that for every equal-length inputs x and 2’, F(x) = 
F(x’) then this means that F,„ is either the constant function zero 
or the constant function one for every n € N. Since the constant 
function has a (very!) small circuit, such a function F will always 
be in P pory (indeed even in smaller classes). Yet by a reduction from 
the Halting problem, we can obtain a function with this property that 
is uncomputable. 

* 


Proof of Theorem 13.15. Consider the following “unary halting func- 
tion” UH : {0,1}* — {0,1} defined as follows. We let S : N — {0,1}* 
be the function that on input n € N, outputs the string that corre- 
sponds to the binary representation of the number n without the most 
significant 1 digit. Note that S is onto. For every x € {0, 1}*, we de- 
fine UH(x) = HALTONZERO(S(|z|)). That is, if n is the length of zx, 
then UH(x) = 1 if and only if the string S (n) encodes a NAND-TM 
program that halts on the input 0. 

UH is uncomputable, since otherwise we could compute 
HALTONZERO by transforming the input program P into the integer 
n such that P = S(n) and then running UH(1”) (i.e., UH on the string 
of n ones). On the other hand, for every n, UH,,(«) is either equal 
to 0 for all inputs x or equal to 1 on all inputs x, and hence can be 
computed by a NAND-CIRC program of a constant number of lines. 

E 


The issue here is of course uniformity. For a function F : {0,1}* > 
{0, 1}, if F is in TIME(T (n)) then we have a single algorithm that 
can compute F;,, for every n. On the other hand, F,,, might be in 
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SIZE(T(n)) for every n using a completely different algorithm for ev- 
ery input length. For this reason we typically use P pory not as a model 
of efficient computation but rather as a way to model inefficient compu- 
tation. For example, in cryptography people often define an encryp- 
tion scheme to be secure if breaking it for a key of length n requires 
more than a polynomial number of NAND lines. Since P C P poly 
this in particular precludes a polynomial time algorithm for doing so, 
but there are technical reasons why working in a non-uniform model 
makes more sense in cryptography. It also allows to talk about se- 
curity in non-asymptotic terms such as a scheme having “128 bits of 
security”. 

While it can sometimes be a real issue, in many natural settings the 
difference between uniform and non-uniform computation does not 
seem so important. In particular, in all the examples of problems not 
known to be in P we discussed before: longest path, 35AT, factoring, 
etc., these problems are also not known to be in P /poly either. Thus, 
for “natural” functions, if you pretend that TIME (T (n)) is roughly the 
same as SIZE(T(n)), you will be right more often than wrong. 


Likely: Possible: 


UHALT 


13.6.4 Uniform vs. Non-uniform computation: A recap 
To summarize, the two models of computation we have described so 
far are: 


e Uniform models: Turing machines, NAND-TM programs, RAM ma- 
chines, NAND-RAM programs, C/JavaScript/Python, etc. These mod- 
els include loops and unbounded memory hence a single program 
can compute a function with unbounded input length. 


e Non-uniform models: Boolean Circuits or straightline programs have 
no loops and can only compute finite functions. The time to execute 
them is exactly the number of lines or gates they contain. 


For a function F : {0,1}* — {0,1} and some nice time bound 
T : N —> N, we know that: 


Figure 13.14: Relations between P, EXP, and P poly. It 
is known that P C EXP, P C P/poly and that P poly 
contains uncomputable functions (which in particular 
are outside of EXP). It is not known whether or not 
EXP C P poy though it is believed that EXP £ P poly- 


e If F is uniformly computable in time T(n) then there is a sequence 
of circuits C4, Cy, ... where C,, has poly(T(n)) gates and computes 
Fyn (i.e., restriction of F to {0,1}") for every n. 


e The reverse direction is not necessarily true - there are examples of 
functions F : {0,1}" — {0,1} such that F,,, can be computed by 
even a constant size circuit but F is uncomputable. 


This means that non-uniform complexity is more useful to establish 


hardness of a function than its easiness. 


13.7 EXERCISES 


Exercise 13.1 — Equivalence of different definitions of P and EXP.. Prove 
that the classes P and EXP defined in Definition 13.2 are equal to 
Uces1,2,3,...} IME(n*) and U.er1,2,3,..; TIME(2™ ) respectively. (If 
S1, 55,53, ... is a collection of sets then the set S = Uset1,2,3,..} Se is 
the set of all elements e such that there exists some c € {1,2,3,...} 
where e € S,.) 

E 


Exercise 13.2 — Robustness to representation. Theorem 13.5 shows that the 
classes P and EXP are robust with respect to variations in the choice 
of the computational model. This exercise shows that these classes 
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are also robust with respect to our choice of the representation of the 
input. 

Specifically, let F be a function mapping graphs to {0, 1}, and let 
F’, F” : {0,1}* — {0,1} be the functions defined as follows. For every 
x € {0,1}*: 


e F’(x) = 1iff x represents a graph G via the adjacency matrix 
representation such that F (G) = 1. 


e F(x) = 1iff x represents a graph G via the adjacency list represen- 
tation such that F'(G) = 1. 


Prove that F” € P iff F” € P. 

More generally, for every function F : {0,1}* > {0,1}, the answer 
to the question of whether F € P (or whether F € EXP) is unchanged 
by switching representations, as long as transforming one represen- 
tation to the other can be done in polynomial time (which essentially 
holds for all reasonable representations). 


Exercise 13.3 — Boolean functions. For every function F : {0,1}* > {0,1}*, 
define Bool(F’) to be the function mapping {0, 1}* to {0,1} such that 
on input a (string representation of a) triple (x, i,o) witha € {0,1}*, 

i € Nando € {0,1}, 


F(x); o =0,i< |F(x)| 
Bool(F)(x,i,o)= < 1 o =1,i < |F(x)| 
0 otherwise 


where F(x); is the i-th bit of the string F(x). 

Prove that for every F : {0,1}* > {0,1}*, Bool(F) € P if and only 
if there is a Turing Machine M and a polynomial p : N —> N such that 
for every x € {0,1}*, on input z, M halts within < p(|x|) steps and 
outputs F(x). 


Exercise 13.4 — Composition of polynomial time. Say that a (possibly non- 
Boolean) function F : {0,1}* > {0, 1}* is computable in polynomial time, 
if there is a Turing Machine M and a polynomial p : N — N such that 
for every x € {0,1}*, on input z, M halts within < p(|x|) steps and 
outputs F(x). Prove that for every pair of functions F,G : {0,1} > 
{0, 1}* computable in polynomial time, their composition F oG, which is 
the function H s.t. H(x) = F(G(a)), is also computable in polynomial 
time. 


Exercise 13.5 — Non-composition of exponential time. Say that a (possibly 
non-Boolean) function F : {0,1}* — {0, 1}* is computable in exponential 


time, if there is a Turing Machine M and a polynomial p : N + N such 
that for every z € {0,1}*, on input z, M halts within < 2?('*)) steps 
and outputs F(x). Prove that there is some F,G : {0,1}* — {0,1}* 
s.t. both F and G are computable in exponential time, but F o G is not 
computable in exponential time. 

a 


Exercise 13.6 — Oblivious Turing Machines. We say that a Turing machine M 
is oblivious if there is some function T : N x N — Z such that for every 
input x of length n, and t € N it holds that: 


e If M takes more than t steps to halt on the input x, then in the t- 
th step M’s head will be in the position T(n, t). (Note that this 
position depends only on the length of x and not its contents.) 


e If M halts before the t-th step then T(n,t) = —1. 


Prove that if F € P then there exists an oblivious Turing machine M 
that computes F in polynomial time. See footnote for hint.! 


Exercise 13.7 Let EDGE : {0,1}* — {0,1} be the function such that on 
input a string representing a triple (L, i, j), where L is the adjacency 
list representation of an n vertex graph G, and i and j are numbers in 
[n], EDGE(L,i, j) = 1 if the edge {i, j} is present in the graph. EDGE 
outputs 0 on all other inputs. 


1. Prove that EDGE e€ P. 


2. Let PLANARMATRIX : {0,1}* — {0,1} be the function that 
on input an adjacency matrix A outputs 1 if and only if the graph 
represented by A is planar (that is, can be drawn on the plane with- 
out edges crossing one another). For this question, you can use 
without proof the fact that PLANARMATRIX € P. Prove that 
PLANARLIST € P where PLANARLIST : {0,1}* — {0,1} is the 
function that on input an adjacency list L outputs 1 if and only if L 
represents a planar graph. 


Exercise 13.8 — Evaluate NAND circuits. Let NANDEVAL : {0,1}* —> 
{0, 1} be the function such that for every string representing a pair 
(Q, x) where Q is an n-input 1-output NAND-CIRC (not NAND-TM!) 
program and x € {0,1}", NANDEVAL(Q,x) = Q(x). On all other 
inputs NANDEVAL outputs 0. 

Prove that NANDEVAL €P. 
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1 Hint: This is the Turing machine analog of Theo- 
rem 13.13. We replace one step of the original TM M’ 
computing F with a “sweep” of the obliviouss TM M 
in which it goes T steps to the right and then T steps 
to the left. 
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Exercise 13.9 — Find hard function. Let NANDHARD : {0,1}* — {0,1} 
be the function such that on input a string representing a pair (f, s) 
where 


e f <{0,1}?" for somen € N 
e sEN 


NANDHARD(f,s) = 1 if there is no NAND-CIRC program Q 
of at most s lines that computes the function F : {0,1}” — {0,1} 
whose truth table is the string f. That is, NANDHARD(f,s) = 1 
if for every NAND-CIRC program Q of at most s lines, there exists 
some x €E {0,1}" such that Q(x) + f,, where f, denote the x-the 
coordinate of f, using the binary representation to identify {0, 1}" 
with the numbers {0,...,2” — 1}. 


1. Prove that NANDHARD € EXP. 


2. (Challenge) Prove that there is an algorithm FINDHARD such 
that if n is sufficiently large, then FINDHARD(1”) runs in time 
22°" and outputs a string f € {0,1}2" that is the truth table of 
a function that is not contained in SIZE(2”/(1000n)). (In other 
words, if f is the string output by FINDHARD(1”) then if we let 
F : {0,1}” — {0,1} be the function such that F(x) outputs the x-th 
coordinate of f, then F ¢ SIZE(2”/(1000n)).? 


Exercise 13.10 Suppose that you are in charge of scheduling courses in 
computer science in University X. In University X, computer science 
students wake up late, and have to work on their startups in the af- 
ternoon, and take long weekends with their investors. So you only 
have two possible slots: you can schedule a course either Monday- 
Wednesday 11am-1pm or Tuesday-Thursday 11am-1pm. 

Let SCHEDULE : {0,1}* — {0,1} be the function that takes as input 
a list of courses L and a list of conflicts C (i.e., list of pairs of courses 
that cannot share the same time slot) and outputs 1 if and only if there 
is a “conflict free” scheduling of the courses in L, where no pair in C is 
scheduled in the same time slot. 

More precisely, the list L is a list of strings (co, ... , C„—1) and the list 
C is a list of pairs of the form (c;,c;). SCHEDULE(L,C) = 1 if and 
only if there exists a partition of co, ... , Cn—1 into two parts so that there 
is no pair (c;,c;) E€ C such that both c, and c; are in the same part. 

Prove that SCHEDULE € P. As usual, you do not have to provide 
the full code to show that this is the case, and can describe operations 
as a high level, as well as appeal to any data structures or other results 
mentioned in the book or in lecture. Note that to show that a function 
F is in P you need to both (1) present an algorithm A that computes 


? Hint: Use Item 1, the existence of functions requir- 
ing exponentially hard NAND programs, and the fact 
that there are only finitely many functions mapping 
{0, 1}” to {0, 1}. 
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F in polynomial time, (2) prove that A does indeed run in polynomial 
time, and does indeed compute the correct answer. 

Try to think whether or not your algorithm extends to the case 
where there are three possible time slots. 


13.8 BIBLIOGRAPHICAL NOTES 


Because we are interested in the maximum number of steps for inputs 
of a given length, running-time as we defined it is often known as 
worst case complexity. The minimum number of steps (or “best case” 
complexity) to compute a function on length n inputs is typically not 
a meaningful quantity since essentially every natural problem will 
have some trivially easy instances. However, the average case complexity 
(i.e., complexity on a “typical” or “random” input) is an interesting 
concept which we’ll return to when we discuss cryptography. That 
said, worst-case complexity is the most standard and basic of the 
complexity measures, and will be our focus in most of this book. 

Some lower bounds for single-tape Turing machines are given in 
[Maa85]. 

For defining efficiency in the \ calculus, one needs to be careful 
about the order of application of the reduction steps, which can matter 
for computational efficiency, see for example this paper. 

The notation P pory is used for historical reasons. It was introduced 
by Karp and Lipton, who considered this class as corresponding to 
functions that can be computed by polynomial-time Turing machines 
that are given for any input length n an advice string of length polyno- 
mial in n. 
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14 
Polynomial-time reductions 


Consider some of the problems we have encountered in Chapter 12: 


1. The 3SAT problem: deciding whether a given 3CNF formula has a 
satisfying assignment. 


2. Finding the longest path in a graph. 

3. Finding the maximum cut ina graph. 

4. Solving quadratic equations over n variables x, ...,%,_, E R. 
All of these problems have the following properties: 


e These are important problems, and people have spent significant 
effort on trying to find better algorithms for them. 


e Each one of these is a search problem, whereby we search for a 
solution that is “good” in some easy to define sense (e.g., a long 
path, a satisfying assignment, etc.). 


e Each of these problems has a trivial exponential time algorithm that 
involve enumerating all possible solutions. 


e At the moment, for all these problems the best known algorithm is 
not much faster than the trivial one in the worst case. 


In this chapter and in Chapter 15 we will see that, despite their 
apparent differences, we can relate the computational complexity of 
these and many other problems. In fact, it turns out that the prob- 
lems above are computationally equivalent, in the sense that solving one 
of them immediately implies solving the others. This phenomenon, 
known as NP completeness, is one of the surprising discoveries of the- 
oretical computer science, and we will see that it has far-reaching 
ramifications. 


Compiled on 12.19.2022 22:58 


Learning Objectives: 


e Introduce the notion of polynomial-time 
reductions as a way to relate the complexity of 
problems to one another. 


e See several examples of such reductions. 


e 3SAT as a basic starting point for reductions. 
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Alg for 3SAT 


+a =") [ser à ant 


svssersum = vax CUT SEN YA 


In this chapter we will see that for each one of the problems of find- 
ing a longest path in a graph, solving quadratic equations, and finding 
the maximum cut, if there exists a polynomial-time algorithm for this 
problem then there exists a polynomial-time algorithm for the 3SAT 
problem as well. In other words, we will reduce the task of solving 
3SAT to each one of the above tasks. Another way to interpret these 
results is that if there does not exist a polynomial-time algorithm for 
3SAT then there does not exist a polynomial-time algorithm for these 
other problems as well. In Chapter 15 we will see evidence (though 
no proof!) that all of the above problems do not have polynomial-time 
algorithms and hence are inherently intractable. 


14.1 FORMAL DEFINITIONS OF PROBLEMS 


For reasons of technical convenience rather than anything substantial, 
we concern ourselves with decision problems (i.e., Yes/No questions) or 


Figure 14.1: In this chapter we show that if the 3SAT 
problem cannot be solved in polynomial time, then 
neither can the QUADEQ, LONGESTPATH, ISET 
and MAXCUT problems. We do this by using the 
reduction paradigm showing for example “if pigs could 
whistle” (i.e., if we had an efficient algorithm for 
QUADEQ) then “horses could fly” (i.e., we would 
have an efficient algorithm for 3SAT.) 


POLYNOMIAL-TIME REDUCTIONS 


in other words Boolean (i.e., one-bit output) functions. We model the 
problems above as functions mapping {0,1}* to {0, 1} in the following 
way: 


3SAT. The 3SAT problem can be phrased as the function 3SAT : 

{0, 1}* — {0,1} that takes as input a 3CNF formula ọ (ie., a formula 
of the form Cp ^+ A C,,,_,; where each C; is the OR of three variables 
or their negation) and maps y to 1 if there exists some assignment to 
the variables of ọ that causes it to evalute to true, and to 0 otherwise. 
For example 


BSAT ("(£9 V T4 V £3) A (z1 V £3 V T3) A (To V Fy V X3)") = 1 


since the assignment x = 1101 satisfies the input formula. In the 
above we assume some representation of formulas as strings, and 
define the function to output 0 if its input is not a valid representation; 
we use the same convention for all the other functions below. 


Quadratic equations. The quadratic equations problem corresponds to the 
function QUADEQ : {0,1}* — {0,1} that maps a set of quadratic 
equations F to 1 if there is an assignment x that satisfies all equations, 
and to 0 otherwise. 


Longest path. The longest path problem corresponds to the function 
LONGPATH : {0,1}* — {0,1} that maps a graph G and a number k 
to 1 if there is a simple path in G of length at least k, and maps (G, k) 
to 0 otherwise. The longest path problem is a generalization of the 
well-known Hamiltonian Path Problem of determining whether a path 
of length n exists in a given n vertex graph. 


Maximum cut. The maximum cut problem corresponds to the function 
MAXCUT : {0,1}* — {0,1} that maps a graph G and a number k to 
1 if there is a cut in G that cuts at least k edges, and maps (G, k) to 0 
otherwise. 

All of the problems above are in EXP but it is not known whether 
or not they are in P. However, we will see in this chapter that if either 
QUADEQ , LONGPATH or MAXCUT are in P, then so is 3SAT. 


14.2 POLYNOMIAL-TIME REDUCTIONS 


Suppose that F',G : {0,1}* — {0,1} are two Boolean functions. A 
polynomial-time reduction (or sometimes just “reduction” for short) from 
F to G is a way to show that F is “no harder” than G, in the sense 

that a polynomial-time algorithm for G implies a polynomial-time 
algorithm for F. 
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Definition 14.1 — Polynomial-time reductions. Let F',G : {0,1}* — {0,1}. 
We say that F reduces to G, denoted by F Š; G if there is a 
polynomial-time computable R +: {0,1}* — {0,1}* such that 


for every x € {0,1}, 
P(x) = G(R(c)).- (14.1) 


We say that F and G have equivalent complexity if F <, G and G <, 
F. 


The following exercise justifies our intuition that F <, G signifies 
that “F is no harder than G”. 


paccnmmescenzeness Output 
F ` ; i = rx) Alg B: Hae = 
Solved Exercise 14.1 — Reductions and P. Prove that if F <, Gand G € P x pre ioe 


then F E P. i poly time 


Input ! 


Figure 14.2: If F <„ G then we can transform a 
polynomial-time algorithm B that computes G 

into a polynomial-time algorithm A that computes 

F. To compute F(x) we can run the reduction R 
guaranteed by the fact that F <, G to obtain y = 
R(x) and then run our algorithm B for G to compute 


Solution: G(y). 
Suppose there is an algorithm B that computes G in time p(n) 


where n is its input size. Then, (14.1) directly gives an algorithm 

A to compute F (see Fig. 14.2). Indeed, oninputx € {0,1}*, Al- 
gorithm A will run the polynomial-time reduction R to obtain 

y = R(x) and then return B(y). By (14.1), G(R(x)) = F(a) and 
hence Algorithm A will indeed compute F. 

We now show that A runs in polynomial time. By assumption, R 
can be computed in time q(n) for some polynomial q. In particular, 
this means that |y| < q(|x|) (as just writing down y takes |y| steps). 
Computing B(y) will take at most p(|y|) <  p(q(|a|)) steps. Thus 
the total running time of A on inputs of length n is at most the time 
to compute y, which is bounded by q(n), and the time to compute 
B(y), which is bounded by p(q(n)), and since the composition of 
two polynomials is a polynomial, A runs in polynomial time. 


14.2.1 Whistling pigs and flying horses 
A reduction from F to G can be used for two purposes: 


POLYNOMIAL-TIME REDUCTIONS 


e If we already know an algorithm for G and F <, G then we can 
use the reduction to obtain an algorithm for F. This is a widely 
used tool in algorithm design. For example in Section 12.1.4 we saw 
how the Min-Cut Max-Flow theorem allows to reduce the task of 
computing a minimum cut in a graph to the task of computing a 
maximum flow in it. 


e If we have proven (or have evidence) that there exists no polynomial- 
time algorithm for F and F <, G then the existence of this reduction 
allows us to conclude that there exists no polynomial-time algo- 
rithm for G. This is the “if pigs could whistle then horses could 
fly” interpretation we’ve seen in Section 9.4. We show that if there 
was an hypothetical efficient algorithm for G (a “whistling pig”) 
then since F <, G then there would be an efficient algorithm for 
F (a “flying horse”). In this book we often use reductions for this 
second purpose, although the lines between the two is sometimes 
blurry (see the bibliographical notes in Section 14.10). 


The most crucial difference between the notion in Definition 14.1 
and the reductions we saw in the context of uncomputability (e.g., 
in Section 9.4) is that for relating time complexity of problems, we 
need the reduction to be computable in polynomial time, as opposed to 
merely computable. Definition 14.1 also restricts reductions to have a 
very specific format. That is, to show that F <, G, rather than allow- 
ing a general algorithm for F that uses a “magic box” that computes 
G, we only allow an algorithm that computes F(x) by outputting 
G(R(«)). This restricted form is convenient for us, but people have 
defined and used more general reductions as well (see Section 14.10). 

In this chapter we use reductions to relate the computational com- 
plexity of the problems mentioned above: 3SAT, Quadratic Equations, 
Maximum Cut, and Longest Path, as well as a few others. We will 
reduce 3SAT to the latter problems, demonstrating that solving any 
one of them efficiently will result in an efficient algorithm for 3SAT. 
In Chapter 15 we show the other direction: reducing each one of these 
problems to 3SAT in one fell swoop. 


Transitivity of reductions. Since we think of F <, G as saying that 

(as far as polynomial-time computation is concerned) F is “easier or 
equal in difficulty to” G, we would expect that if F <, G and G <, H, 
then it would hold that F <, H. Indeed this is the case: 


Solved Exercise 14.2 — Transitivity of polynomial-time reductions. For every 


F,G,H : {0,1}* > {0,1}, if F <, Gand G <, H then F <, H. 
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Solution: 

fF <, GandG <, H then there exist polynomial-time com- 
putable functions R, and R, mapping {0, 1}* to {0, 1}* such that 
forevery x € {0,1}*, F(x) = G(R,(x)) and for every y € {0,1}*, 
G(y) =  H(R,(y)). Combining these two equalities, we see that 
foreveryx € {0,1}*, F(x) = H(R,(R,(zx))) and so to show that 
F <, H, itis sufficient to show thatthe mapz ++ R,(R,(x)) is 
computable in polynomial time. But if there are some constants c, d 
such that R, (x) is computable in time |x| and R,(y) is computable 
in time |y|¢ then R.(R,(x)) is computable in time (|x|)? = |a|°¢ 
which is polynomial. 


14.3 REDUCING 3SAT TO ZERO ONE AND QUADRATIC EQUATIONS 


We now show our first example of a reduction. The Zero-One Lin- 

ear Equations problem corresponds to the function 01EQ : {0,1}* > 

{0, 1} whose input is a collection E of linear equations in variables 

Lo, --+;Lp_1, and the output is 1 iff there is an assignment x € {0,1}” 
of 0/1 values to the variables that satisfies all the equations. For exam- 
ple, if the input F is a string encoding the set of equations 


To + z1 +2 =2 


Totzz=1l 


ti +2, =2 


then 01EQ(E) = 1 since the assignment x = 011 satisfies all three 
equations. We specifically restrict attention to linear equations in 
variables 9, ... , £„—ı in which every equation has the form )7,..¢%; = 
b where S C [n] and b € N.1 

If we asked the question of whether there is a solution x € R” of 
real numbers to E, then this can be solved using the famous Gaussian 
elimination algorithm in polynomial time. However, there is no known 
efficient algorithm to solve 01EQ. Indeed, such an algorithm would 
imply an algorithm for 3SAT as shown by the following theorem: 


| Theorem 14.2 — Hardness of 01EQ. 3SAT = O1EQ 


Proof Idea: 

A constraint xə V Ts V xy canbe written as x + (1 — z5) + z7 >1. 
This is a linear inequality but since the sum on the left-hand side is 
at most three, we can also turn it into an equality by adding two new 
variables y, z and writing it as x, + (1 — x5) + £7 +y +z = 3. (We 
will use fresh variables y, z for every constraint.) Finally, for every 
variable x; we can add a variable x; corresponding to its negation by 


1 If you are familiar with matrix notation you may 
note that such equations can be written as Ax = b 
where A is an m x n matrix with entries in 0/1 and 
b € N”. 


adding the equation x; +x; = 1, hence mapping the original constraint 


Tə V Ts V t7 tor, + £g + £7 +y +z = 3. The main takeaway 
technique from this reduction is the idea of adding auxiliary variables 
to replace an equation such as z4 + x, + £3 > 1 that is not quite in the 
form we want with the equivalent (for 0/1 valued variables) equation 
Tı + £a + £3 +U +v = 3 which is in the form we want. 


* 


SAT2Z0E("(x@ V +x3 V x 


Proof of Theorem 14.2. To prove the theorem we need to: 


1. Describe an algorithm R for mapping an input ọ for 3SAT into an 
input E for 01EQ. 


2. Prove that the algorithm runs in polynomial time. 


3. Prove that 0D1EQ(R(y)) = 3SAT(y) for every 3CNF formula g. 


We now proceed to do just that. Since this is our first reduction, we 
will spell out this proof in detail. However it straightforwardly follows 
the proof idea. 
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Figure 14.3: Left: Python code implementing the 
reduction of 3SAT to 01EQ. Right: Example output of 
the reduction. Code is in our repository. 
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The reduction is described in Algorithm 14.3, see also Fig. 14.3. If 
the input formula has n variables and m clauses, Algorithm 14.3 cre- 


ates a set E of n + m equations over 2n + 2m variables. Algorithm 14.3 
makes an initial loop of n steps (each taking constant time) and then 
another loop of m steps (each taking constant time) to create the equa- 
tions, and hence it runs in polynomial time. 

Let R be the function computed by Algorithm 14.3. The heart of 
the proof is to show that for every 3CNF y, 01EQ(R(y)) = 3SAT(y). 
We split the proof into two parts. The first part, traditionally known 
as the completeness property, is to show that if 35AT(y) = 1 then 
O1EQ(R(y~)) = 1. The second part, traditionally known as the sound- 
ness property, is to show that if D1IEQ(R(y)) = 1 then 3SAT(y) = 1. 
(The names “completeness” and “soundness” derive viewing a so- 
lution to R(y) as a “proof” that ¢ is satisfiable, in which case these 
conditions corresponds to completeness and soundness as defined 
in Section 11.1.1. However, if you find the names confusing you can 
simply think of completeness as the “1-instance maps to 1-instance” 


POLYNOMIAL-TIME REDUCTIONS 


property and soundness as the “0-instance maps to 0-instance” prop- 
erty.) 
We complete the proof by showing both parts: 


e Completeness: Suppose that 3SAT(y) = 1, which means that 
there is an assignment x € {0,1}” that satisfies y. If we use the 
assignment x,...,%,_, and 1 — 2,...,1 — x,,_, for the first 2n 
variables of E = R(y) then we will satisfy all equations of the form 
x, + x; = 1. Moreover, for every j € [m], if to + tı + t3 +y; +2; = 
3(*) is the equation arising from the jth clause of p (with to, t,, ta 
being variables of the form x; or x; depending on the literals of the 
clause) then our assignment to the first 2n variables ensures that 
to +t, +t, > 1 (since z satisfied y) and hence we can assign values 
to y; and z; that will ensure that the equation (x) is satisfied. Hence 
in this case E = R() is satisfied, meaning that 01EQ(R(y)) = 1. 


e Soundness: Suppose that 01EQ(R(y)) = 1, which means that the 
set of equations E = R(y) has a satisfying assignment Zp, ...,2,_1, 
LO Ehi Yor Ym—17 Žo = Zm—1: Then, since the equations 
contain the condition x; + x; = 1, for every i € [n], x; is the negation 
of x;, and morover, for every j € [m], if C has the form wọ V w, V ws 
and is the j-th clause of y, then the corresponding assignment x 
will ensure that wọ + w + wa > 1, implying that C is satisfied. 


Hence in this case 3SAT(y) = 1. 


14.3.1 Quadratic equations 
Now that we reduced 3SAT to 01EQ, we can use this to reduce 3SAT 
to the quadratic equations problem. This is the function QUADEQ in 
which the input is a list of n-variate polynomials po, ... , Pm—1 : R” —> R 
that are all of degree at most two (i.e., they are quadratic) and with 
integer coefficients. (The latter condition is for convenience and can 
be achieved by scaling.) We define QUADEQ(pp, ... , Pm_—1) to equal 1 
if and only if there is a solution x € R” to the equations pọ(x) = 0, 
pı(z) = 0, .., Pm- (£) = 0. 
For example, the following is a set of quadratic equations over the 

variables £9, £1, £a: 

z2 — zo =0 

x? —2,=0 

x2 —z,=0 


1-2 —2,+ 2% 97, =0 


You can verify that x € R? satisfies this set of equations if and only if 
x € {0,1}? and zy Vz = 1. 
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Theorem 14.4 — Hardness of quadratic equations. 


3SAT <„ QUADEQ 


Proof Idea: 

Using the transitivity of reductions (Solved Exercise 14.2), it is 
enough to show that 01EQ <,, QUADEQ, but this follows since we can 
phrase the equation x; € {0,1} as the quadratic constraint xz? — x; = 0. 
The takeaway technique of this reduction is that we can use non- 
linearity to force continuous variables (e.g., variables taking values in 
R) to be discrete (e.g., take values in {0,1}). 


* 


Proof of Theorem 14.4. By Theorem 14.2 and Solved Exercise 14.2, 

it is sufficient to prove that 0IEQ <, QUADEQ. Let E be an in- 
stance of 01EQ with variables xp,...,x,,_;. We map E to the set of 
quadratic equations Æ’ that is obtained by taking the linear equations 
in E and adding to them the n quadratic equations x? — x; = 0 for 
alli € [n]. (See Algorithm 14.5.) The map E } E” can be com- 
puted in polynomial time. We claim that 01EQ(E) = 1 if and only 

if QUADEQ(E’) = 1. Indeed, the only difference between the two 
instances is that: 


e Inthe 01EQ instance F, the equations are over variables xo, ... , £ 
in {0, 1}. 


n—-1 


e Inthe QUADEQ instance F’, the equations are over variables 
To; -++;Lp__1 E R but we have the extra constraints x? — x; = 0 
for alli € [n]. 


Since for every a € R, a? — a = Oif and only if a € {0, 1}, the two 
sets of equations are equivalent and 01EQ(E) = QUADEQ(E’) which 
is what we wanted to prove. 

a 


14.4 THE SUBSET SUM PROBLEM 


As another consequence of the reduction of 3SAT to 01EQ, we can 
also show that 3SAT (through 01EQ) reduces to the subset sum prob- 
lem (also known as the knapsack problem). In the subset sum prob- 
lem, we are given a list of integers xo, ...,”,,_, E€ Zand an integer 

T € Z. We need to determine whether or not there exists some set 

of the integers that sums up to T. That is, for v,...,%,_4,T € Z, 
SSUM (ao, ..., 2,1, T) = 1 if and only if there exists S C [n] such that 
Vices Zi = T. Note that the input length for the subset sum problem 
is the length of string needed to encode all the numbers, which will 
be approximately [log T] + SA [log x,|, since encoding an integer x 
using the binary representation requires [log x] bits. 


Theorem 14.6 — Hardness of subset sum. 


3SAT <, SSUM 


Proof Idea: 

We reduce from 01EQ. The intuition is the following. Consider 
an instance E of 01EQ with n variables x, ...,%,_, and m equa- 
tions €p, ..., €m—1- Recall that each equation e, in E has the form 
x, +x; + 2, = b (potentially with more or less than three variables 
summed up on the left-hand side of the equation). For every variable 
x; we can define a vector vê € {0,1} where vt = 1 if the variable 
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x, appears in the equation e, and vt = 0 otherwise. Then there is a 
solution to the set of equations if and only if there is some set S C [n] 
(corresponding to the i’s such that x; = 1) such that $ g% = b 
where b € Z” is the vector of right hand sides of the equations (i.e., b, 
is the value b, on the righthand side of the t-th equation). Now if we 
could interpret the vectors v®, ...,v”~! and b as numbers then we could 
think of this as a subset sum instance. The key insight is that we can in 
fact think of vectors as numbers by thinking of the j-th coordinate of 
the vector v as the j-th digit. Since the vectors are in {0, 1 }”, the nat- 
ural choice is to use the binary basis, but this turns out to cause issues 
with “carries” when we add them up. Hence we use a larger basis B, 
see proof below. 

* 


Proof of Theorem 14.6. For a given set of 01EQ on n variables, we note 
that the right hand side can never be larger than n (since the sum of at 
most n variables in {0, 1} is at most n). More concretely, if the instance 
has such an equation then we can know for sure that the answer is 
0 (and in the context of a reduction map it into some trivial instance 
of subset sum that doesn’t have a solution such as x» = x, = 1 and 
T =3). 

Our reduction is described in Algorithm 14.7. On input an instance 
E = {e,}i@, of 01EQ over n variables xp,...,2,,_;, we output an SSUM 
instance Yo, ... , Yn—1; T computed as follows: 


m-1 


e y; = 9 B’v; where vj equals 1 if the variable x; appears in the 
equation e, and equals 0 otherwise. The number B is set to be 2n 
(any numb er larger than n would work.) 


e T= aire B‘b, where b, is the integer on the right-hand side of the 
equation e,. 


In other words, yo, ..., Yn—ı and T are the integers such that, written 
in the B-ary basis, the t-th digit of y; is 1 iff x; appears in x,, and the 
t-th digit of T is the right-hand side of e,. 

The following claim will imply the correctness of the reduction: 

Claim: For every x € {0,1}",if S = {i|x; = 1} then z satisfies the 
equations of E if and only if `,- Y; = T. 

Proof: Key to the proof is the following simple property of grade- 
school addition: when adding at most n numbers in the B-ary basis, 
if all the numbers have all their digits either 0 or 1, and B > n, then 
for every t, the t-th digit of the sum is the sum of the t-th digits of 
the numbers. This is a simple consequence of the fact that there is no 
“carry” in the addition. Since in our case the numbers yo, ... , Y, sat- 
isfy this property in the B-ary basis, and B > n, we get that for every 


S C [n] and every digit t, the t-th digit of the sum }7,_. Y; is simply 
the sum of the t-th digit, which would correspond to the sum over z; 
for all «,’s that participate in the t-th equation. This sum would equal 
the t-th digit of T if and only if that equation is satisfied. 

The claim shows that 01EQ(E) = SSUM(yp, --- , Yn—1; T) which is 
what we needed to prove. 


14.5 THE INDEPENDENT SET PROBLEM 


Fora graph G = (V, E), an independent set (also known as a stable 
set) is a subset S C V such that there are no edges with both end- 
points in S (in other words, E(S, 5) = @). Every “singleton” (set 
consisting of a single vertex) is trivially an independent set, but find- 
ing larger independent sets can be challenging. The maximum indepen- 
dent set problem (henceforth simply “independent set’) is the task of 
finding the largest independent set in the graph. The independent set 
problem is naturally related to scheduling problems: if we put an edge 
between two conflicting tasks, then an independent set corresponds 
to a set of tasks that can all be scheduled together without conflicts. 
The independent set problem has been studied in a variety of settings, 
including for example in the case of algorithms for finding structure in 
protein-protein interaction graphs. 

As mentioned in Section 14.1, we think of the independent set prob- 
lem as the function ISET : {0,1}* — {0,1} that on input a graph G 
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and a number k outputs 1 if and only if the graph G contains an in- 
dependent set of size at least k. We now reduce 3SAT to Independent 
set. 


| Theorem 14.8 — Hardness of Independent Set. 3SAT <, ISET. 


Proof Idea: 

The idea is that finding a satisfying assignment to a 3SAT formula 
corresponds to satisfying many local constraints without creating 
any conflicts. One can think of “xı; = 0” and “zı; = 1” as two 
conflicting events, and of the constraints 7,7 V T5 V £g as creating 
a conflict between the events “x;y = 0”,”“x, = 1” and “zg = 0”, 
saying that these three cannot simultaneously co-occur. Using these 
ideas, we can we can think of solving a 3SAT problem as trying to 
schedule non-conflicting events, though the devil is, as usual, in the 
details. The takeaway technique here is to map each clause of the 


original formula into a gadget which is a small subgraph (or more 
generally “subinstance”) satisfying some convenient properties. We 
will see these “gadgets” used time and again in the construction of 
polynomial-time reductions. 

* 


Figure 14.4: An example of the reduction of 3SAT 

to ISET for the case the original input formula is 

p= (£o VTi Va) A (To Vti VB) A (x1 V T2 VT3). We 
map each clause of ¢ to a triangle of three vertices, 
each tagged above with “x; = 0” or “x; = 1” 
depending on the value of x; that would satisfy the 
particular literal. We put an edge between every two 
literals that are conflicting (i.e., tagged with “x; = 0” 
and “a; = 1” respectively). 
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Proof of Theorem 14.8. Given a 3SAT formula y on n variables and 
with m clauses, we will create a graph G with 3m vertices as follows. 
(See Algorithm 14.9, see also Fig. 14.4 for an example and Fig. 14.5 for 
Python code.) 


e Aclause C in ọ has the form C = y V y’ V y” where y, y’, y” are 
literals (variables or their negation). For each such clause C, we will 
add three vertices to G, and label them (C, y), (C, y’), and (C, y”) 
respectively. We will also add the three edges between all pairs of 
these vertices, so they form a triangle. Since there are m clauses in y, 
the graph G will have 3m vertices. 


e In addition to the above edges, we also add an edge between ev- 
ery pair of vertices of the form (C, y) and (C’, y’) where y and y’ 
are conflicting literals. That is, we add an edge between (C, y) and 
(C’, y’) if there is an i such that y = x, and y’ = 7; or vice versa. 


The algorithm constructing G based on ¢ takes polynomial time 
since it involves two loops, the first taking O(m) steps and the second 
taking O(m?n) steps (see Algorithm 14.9). Hence to prove the theo- 
rem we need to show that ọ is satisfiable if and only if G contains an 
independent set of m vertices. We now show both directions of this 
equivalence: 

Part 1: Completeness. The “completeness” direction is to show that 
if y has a satisfying assignment «*, then G has an independent set S* 
of m vertices. Let us now show this. 

Indeed, suppose that ọ has a satisfying assignment x* € {0,1}". 
Then for every clause C = y V y’ V y” of ọ, one of the literals y, y’, y” 
must evaluate to true under the assignment x* (as otherwise it would 
not satisfy y). We let S be a set of m vertices that is obtained by choos- 
ing for every clause C one vertex of the form (C, y) such that y eval- 
uates to true under «*. (If there is more than one such vertex for the 
same C, we arbitrarily choose one of them.) 

We claim that S is an independent set. Indeed, suppose otherwise 
that there was a pair of vertices (C, y) and (C’, y’) in S that have an 
edge between them. Since we picked one vertex out of each triangle 
corresponding to a clause, it must be that C + C”. Hence the only 
way that there is an edge between (C, y) and (C’, y’) is if y and y’ are 
conflicting literals (i.e. y = x; and y’ = 7, for some i). But then they 
can’t both evaluate to true under the assignment x*, which contradicts 
the way we constructed the set S. This completes the proof of the 
completeness condition. 

Part 2: Soundness. The “soundness” direction is to show that if 
G has an independent set 5* of m vertices, then y has a satisfying 
assignment z* € {0, 1}”. Let us now show this. 
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Indeed, suppose that G has an independent set S* with m vertices. 
We will define an assignment «* € {0,1}” for the variables of y as 
follows. For every i € [n], we set x} according to the following rules: 


e If S* contains a vertex of the form (C, x;) then we set x7 = 1. 
e If S* contains a vertex of the form (C,Z;) then we set x? = 0. 


e If S* does not contain a vertex of either of these forms, then it does 
not matter which value we give to x, but for concreteness we'll set 
x; =0. 


The first observation is that x* is indeed well defined, in the sense 
that the rules above do not conflict with one another, and ask to set x; 
to be both 0 and 1. This follows from the fact that S* is an independent 
set and hence if it contains a vertex of the form (C, x;) then it cannot 
contain a vertex of the form (C’, T7). 

We now claim that «* is a satisfying assignment for ọ. Indeed, since 
S* is an independent set, it cannot have more than one vertex inside 
each one of the m triangles (C, y), (C, y’), (C, y”) corresponding to a 
clause of y. Hence since |S*| = m, it must have exactly one vertex in 
each such triangle. For every clause C of y, if (C, y) is the vertex in 
S* in the triangle corresponding to C, then by the way we defined «*, 
the literal y must evaluate to true, which means that 2* satisfies this 
clause. Therefore x* satisfies all clauses of p, which is the definition of 
a satisfying assignment. 

This completes the proof of Theorem 14.8 


$ = "(x0 v ~x 
SAT2IS($) 


ge(nn red”) 
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14.6 SOME EXERCISES AND ANATOMY OF A REDUCTION. 


Reductions can be confusing and working out exercises is a great way 
to gain more comfort with them. Here is one such example. As usual, 
I recommend you try it out yourself before looking at the solution. 


Figure 14.5: The reduction of 3SAT to Independent Set. 
On the right-hand side is Python code that implements 
this reduction. On the left-hand side is a sample 
output of the reduction. We use black for the “triangle 
edges” and red for the “conflict edges”. Note that the 
satisfying assignment «* = 0110 corresponds to the 
independent set (0, -%3), (1, 7a), (2, £2). 


Solved Exercise 14.3 — Vertex cover. A vertex cover ina graph G = (V, E) 
is a subset S C V of vertices such that every edge touches at least 
one vertex of S (see Fig. 14.6). The vertex cover problem is the task to 
determine, given a graph G and a number k, whether there exists a 
vertex cover in the graph with at most k vertices. Formally, this is the 
function VC : {0,1}* — {0,1} such that for every G = (V, E) and 
k € N, VC(G,k) = 1 if and only if there exists a vertex cover S C V 
such that | S| < k. 

Prove that 3SAT <, VC. 


Solution: 

The key observation is that if S Cc V is a vertex cover that 
touches all vertices, then there is no edge e such that both e’s end- 
points are in the set S = V \ S,and vice versa. In other words, 
S is a vertex cover if and only if S is an independent set. Since 
the size of Sis|V|_ — |S], we see that the polynomial-time map 
R(G,k) = (G,n — k) (where n is the number of vertices of G) 
satisfies that VC(R(G,k)) = ISET(G,k) which means that it is a 
reduction from independent set to vertex cover. 


Solved Exercise 14.4 — Clique is equivalent to independent set. The maximum 
clique problem corresponds to the function CLIQUE : {0,1}* — {0,1} 
such that for a graph G and a number k, CLIQUE(G,k) = 1 iff there 
is a subset S of k vertices such that for every distinct u,v € S, the edge 
u,v is in G. Such a set is known as a clique. 

Prove that CLIQUE <, ISET and ISET <,, CLIQUE. 


Solution: 

IfG = (V,E) isa graph, we denote by G its complement which 
is the graph on the same vertices V and such that for every distinct 
u,v € V,the edge {u,v} is present in G if and only if this edge is 
not present in G. 

This means that for every set S, S is an independent set in G if 
and only if S is a clique in G. Therefore for every k, ISET(G, k) = 
CLIQUE(G, k). Since the map G ++ G can be computed efficiently, 
this yields a reduction ISET < CLIQUE. Moreover, since G = G 
this yields a reduction in the other direction as well. 


14.6.1 Dominating set 
In the two examples above, the reduction was almost “trivial”: the 
reduction from independent set to vertex cover merely changes the 
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Figure 14.6: A vertex cover in a graph is a subset of 
vertices that touches all edges. In this 7-vertex graph, 
the 3 filled vertices are a vertex cover. 
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number k to n — k, and the reduction from independent set to clique 
flips edges to non-edges and vice versa. The following exercise re- 
quires a somewhat more interesting reduction. 


Solved Exercise 14.5 — Dominating set. A dominating set ina graphG = 
(V, E) is a subset S C V of vertices such that every u € V \ Sisa 
neighbor in G of some s € S (see Fig. 14.7). The dominating set problem 
is the task, given a graph G = (V, E) and number k, of determining 
whether there exists a dominating set S C V with |S| < k. Formally, 
this is the function DS : {0,1}* — {0,1} such that DS(G,k) = 1 iff 
there is a dominating set in G of at most k vertices. 

Prove that ISET <,, DS. 


Solution: 

Since we know that ISET <,, VC, using transitivity, it is enough 
to show that VC <, DS. As Fig. 14.7 shows, a dominating set is 
not the same thing as a vertex cover. However, we can still relate 
the two problems. The idea is to map a graph G into a graph H 
such that a vertex cover in G would translate into a dominating set 
in H and vice versa. We do so by including in H all the vertices 
and edges of G, but for every edge {u,v} of G we also add to H a 
new vertex w, „ and connect it to both u and v. Let £ be the number 
of isolated vertices in G. The idea behind the proof is that we can 
transform a vertex cover S of k vertices in G into a dominating set 
ofk + ¢vertices in H by adding to S all the isolated vertices, and 
moreover we can transform every k + é-sized dominating set in H 
into a vertex cover in G. We now give the details. 

Description of the algorithm. Given an instance (G, k) for the 
vertex cover problem, we will map G into an instance (H, k’) for 
the dominating set problem as follows (see Fig. 14.8 for Python 
implementation): 


Figure 14.7: A dominating set is a subset S of vertices 
such that every vertex in the graph is either in S ora 
neighbor of S. The figure above are two copies of the 
same graph. The red vertices on the left are a vertex 
cover that is not a dominating set. The blue vertices 
on the right are a dominating set that is not a vertex 
cover. 


Algorithm 14.10 runs in polynomial time, since the loop takes 
O(m) steps where m is the number of edges, with each step can be 
implemented in constant or at most linear time (depending on the 


representation of the graph H). Counting the number of isolated 
vertices in an n vertex graph G can be done in time O(n”) if G is 
represented in the adjacency matrix representation and O(n) time 
if it is represented in the adjacency list representation. Regardless 
the algorithm runs in polynomial time. 

To complete the proof we need to prove that for every G, k, 
if H, k’ is the output of Algorithm 14.10 on input (G, k), then 
DS(H,k’) =  VC(G,k). We split the proof into two parts. The 
completeness part is that if VC(G,k) = 1then DS(H,k’) = 1. The 
soundness part is that if DS(H,k’) = 1 then VC(G,k) = 1. 

Completeness. Suppose that VC(G,k) = 1. Then there is a ver- 
tex cover S C V of at most k vertices. Let J be the set of isolated 
vertices in G and £ be their number. Then |S U I| < k + £. We claim 
that S U JI isa dominating set in H. Indeed for every vertex v of H 
there are three cases: 


e Case 1: vis an isolated vertex of G. In this case v isin S U T. 


e Case 2: vis a non-isolated vertex of G and hence there is an edge 
{u,v} of G for some u. In this case since S is a vertex cover, one 
of u,v has to be in S, and hence either v or a neighbor of v has to 
bein SC SUT. 


e Case 3: vis of the form w, „ for some two neighbors u, u’ in G. 
But then since S is a vertex cover, one of u, u’ has to be in S and 
hence S contains a neighbor of v. 
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We conclude thatS U Tisa dominating set of size at most 
k’ = k+ Lin H’ and hence under the assumption that VC(G,k) = 1, 
DS(H’,k’) =1. 

Soundness. Suppose that DS(H,k’) = 1. Then there is a domi- 
nating set D of size at most k’ = k + L in H. For every edge {u,v} in 
the graph G, if D contains the vertex w,, „ then we remove this ver- 
tex and add u in its place. The only two neighbors of w, „ are u and 
v, and since u is a neighbor of both w, „ and of v, replacing wu v 
with u maintains the property that it is a dominating set. More- 
over, this change cannot increase the size of D. Thus following this 
modification, we can assume that D is a dominating set of at most 
k + £ vertices that does not contain any vertices of the form w, ,. 

Let J be the set of isolated vertices in G. These vertices are also 
isolated in H and hence must be included in D (an isolated ver- 
tex must be in any dominating set, since it has no neighbors). We 
let S = D\ I. Then|S| < J. We claim that S is a vertex cover 
in G. Indeed, for every edge {u, v} of G, either the vertex w, „ or 
one of its neighbors must be in S by the dominating set property. 
But since we ensured S doesn’t contain any of the vertices of the 


form w, ,,, it must be the case that either u or v is in S. This shows 


that S is a vertex cover of G of size at most k, hence proving that 
VC(G,k) =1. 


A corollary of Algorithm 14.10 and the other reduction we have 
seen so far is that if DS € P (i.e., dominating set has a polynomial-time 
algorithm) then 3SAT € P (ie., 3SAT has a polynomial-time algo- 
rithm). By the contra-positive, if 35 AT does not have a polynomial- 
time algorithm then neither does dominating set. 


Figure 14.8: Python implementation of the reduction 

from vertex cover to dominating set, together with an 

example of an input graph and the resulting output 

graph. This reduction allows to transform a hypothet- 

A ical polynomial-time algorithm for dominating set (a 

A a “whistling pig”) into a hypothetical polynomial-time 
sun E> AN ,6 algorithm for vertex-cover (a “flying horse”). 


r u in G.nodes() if not list(G.neighbors(u)) ] POMC 
(isolated) LS 


14.6.2 Anatomy of a reduction 
The reduction of Solved Exercise 14.5 gives a good illustration of the 
anatomy of a reduction. A reduction consists of four parts: 


POLYNOMIAL-TIME REDUCTIONS 473 


Completeness: Figure 14.9: The four components of a reduction, 
>G illustrated for the particular reduction of vertex cover 
5 to dominating set. A reduction from problem F to 
O| d problem G is an algorithm that maps an input x for F 
Algorithm description: a| E S into an input R(x) for G. To show that the reduction 
À ; ee is correct we need to show the properties of efficiency: 
Has VC of size < 4 Has DS of size < 6 algorithm R runs in polynomial time, completeness: 
Sree S ne if F(x) = 1 then G(R(x)) = 1, and soundness: if 
X SSG F(R(x)) = 1 then G(x) = 1. 


def VC2DS(G,k): 
"""Reduce ver 


is tin u in G.nodes() if not list(G.neighbors(u)) ] aa i 

return H,k + len(isolated) GO 
Running time analysis: Tei 
Count operations / # loop executions a le OSE | a 


Has DS of size < 6 Has VC of size < 4 


e Algorithm description: This is the description of how the algorithm 
maps an input into the output. For example, in Solved Exercise 14.5 
this is the description of how we map an instance (G, k) of the 
vertex cover problem into an instance (H, k’) of the dominating set 
problem. 


e Algorithm analysis: It is not enough to describe how the algorithm 
works but we need to also explain why it works. In particular we 
need to provide an analysis explaining why the reduction is both 
efficient (i.e., runs in polynomial time) and correct (satisfies that 
G(R(x)) = F(x) for every x). Specifically, the components of 
analysis of a reduction R include: 


- Efficiency: We need to show that R runs in polynomial time. In 
most reductions we encounter this part is straightforward, as the 
reductions we typically use involve a constant number of nested 
loops, each involving a constant number of operations. For ex- 
ample, the reduction of Solved Exercise 14.5 just enumerates over 
the edges and vertices of the input graph. 


- Completeness: In a reduction R demonstrating F <, G, the 
completeness condition is the condition that for every x € {0,1}*, 
if F(x) = 1thenG(R(x)) = 1. Typically we construct the 
reduction to ensure that this holds, by giving a way to map a 
“certificate/solution” certifying that F(z) = 1 into a solution 
certifying that G(R(x)) = 1. For example, in Solved Exercise 14.5 
we constructed the graph H such that for every vertex cover S 
in G, the set S U I (where J is the isolated vertices) would be a 
dominating set in H. 

— Soundness: This is the condition that if F(x) = 0 then 
G(R(x)) = 0 or (taking the contrapositive) if G(R(x)) = 1 then 
F(x) = 1. This is sometimes straightforward but can often be 
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harder to show than the completeness condition, and in more 
advanced reductions (such as the reduction 3SAT <,, ISET 

of Theorem 14.8) demonstrating soundness is the main part 

of the analysis. For example, in Solved Exercise 14.5 to show 
soundness we needed to show that for every dominating set D in 
the graph H, there exists a vertex cover S of size at most |D| — £ 
in the graph G (where £ is the number of isolated vertices). 

This was challenging since the dominating set D might not be 
necessarily the one we “had in mind”. In particular, in the proof 
above we needed to modify D to ensure that it does not contain 


vertices of the form w, „ and it was important to show that this 


u,v’ 


modification still maintains the property that D is a dominating 
set, and also does not make it bigger. 


Whenever you need to provide a reduction, you should make sure 
that your description has all these components. While it is sometimes 
tempting to weave together the description of the reduction and its 
analysis, it is usually clearer if you separate the two, and also break 
down the analysis to its three components of efficiency, completeness, 
and soundness. 


14.7 REDUCING INDEPENDENT SET TO MAXIMUM CUT 


We now show that the independent set problem reduces to the maxi- 
mum cut (or “max cut”) problem, modeled as the function MAXCUT 
that on input a pair (G, k) outputs 1 iff G contains a cut of at least k 
edges. Since both are graph problems, a reduction from independent 
set to max cut maps one graph into the other, but as we will see the 
output graph does not have to have the same vertices or edges as the 
input graph. 


| Theorem 14.11 — Hardness of Max Cut. ISET Š; MAXCUT 


Proof Idea: 

We will map a graph G into a graph H such that a large indepen- 
dent set in G becomes a partition cutting many edges in H. We can 
think of a cut in H as coloring each vertex either “blue” or “red”. We 
will add a special “source” vertex s*, connect it to all other vertices, 
and assume without loss of generality that it is colored blue. Hence 
the more vertices we color red, the more edges from s* we cut. Now, 
for every edge u,v in the original graph G we will add a special “gad- 
get” which will be a small subgraph that involves u,v, the source s*, 
and two other additional vertices. We design the gadget in a way so 
that if the red vertices are not an independent set in G then the cor- 
responding cut in H will be “penalized” in the sense that it would 


not cut as many edges. Once we set for ourselves this objective, it is 
not hard to find a gadget that achieves it— see the proof below. Once 
again the takeaway technique is to use (this time a slightly more 
clever) gadget. 

* 


n vertices 


2m vertices 


So OQO ‘Od’ QO 


Proof of Theorem 14.11. We will transform a graph G of n vertices and 
m edges into a graph H of n + 1 + 2m vertices and n + 5m edges in the 
following way (see also Fig. 14.10). The graph H contains all vertices 
of G (though not the edges between them!) and in addition H also 
has: 

* A special vertex s* that is connected to all the vertices of G 

* For every edge e = {u,v} € E(G), two vertices eg, e, such that eg 
is connected to u and e, is connected to v, and moreover we add the 
edges {€, €,}, {€9, 5°}, {e1, 5°} to H. 

Theorem 14.11 will follow by showing that G contains an inde- 
pendent set of size at least k if and only if H has a cut cutting at least 
k + 4m edges. We now prove both directions of this equivalence: 

Part 1: Completeness. If J is an independent k-sized set in G, then 
we can define S to be a cut in H of the following form: we let S con- 
tain all the vertices of I and for every edge e = {u,v} € E(G),ifu € I 
and v ¢ I then we add e; to S;ifu ¢ I and v € J then we add ey to 
S;andifu ¢ I and v ¢ I then we add both eg and e, to S. (We don’t 
need to worry about the case that both u and v are in J since it is an 
independent set.) We can verify that in all cases the number of edges 
from S to its complement in the gadget corresponding to e will be four 
(see Fig. 14.11). Since s* is not in S, we also have k edges from s* to I, 
for a total of k + 4m edges. 

Part 2: Soundness. Suppose that S is a cut in H that cuts at least 
C = k + 4m edges. We can assume that s* is not in S (otherwise we 
can “flip” S to its complement S, since this does not change the size 
of the cut). Now let I be the set of vertices in S that correspond to the 
original vertices of G. If I was an independent set of size k then we 
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Figure 14.10: In the reduction of ISET to MAXCUT 
we map an n-vertex m-edge graph G into the 

n + 2m + 1 vertex and n + 5m edge graph H as 
follows. The graph H contains a special “source” 
vertex s*,n vertices Ug, ...,U,_1, and 2m ver- 
tices ef}, e4,...,e9, 1, e4,_, with each pair cor- 
responding to an edge of G. We put an edge be- 
tween s* and v; for every i € [n], and if the ¢-th 

edge of G was (v;, vj) then we add the five edges 

(s*, e?), (s*, et), (v;, e), (vj et); (e?, el). The intent 
is that if we cut at most one of v;, vj from s* then 
we'll be able to cut 4 out of these five edges, while if 
we cut both v; and v; from s* then we'll be able to cut 
at most three of them. 
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would be done. This might not always be the case but we will see that 
if I is not an independent set then it’s also larger than k. Specifically, 
= |E(I,1I)| be the set of edges in G that are contained 
(i.e., if I is an independent set then 


we define Min 
in I and let Mout = M — Min 
Min = 0 and Mous = M). By the properties of our gadget we know 
that for every edge {u, v} of G, we can cut at most three edges when 
both u and v are in S, and at most four edges otherwise. Hence the 
number C of edges cut by S satisfies C < |I| + 3Min + 4Mout = 
| + 3m,,, + 4(m — Min) = H| + 4m — mMin: Since C = k + 4m we 
get that |7| 
set I’ by going over every one of the m,,, edges that are inside J and 


— M;n => k. Now we can transform J into an independent 


removing one of the endpoints of the edge from it. The resulting set I’ 
—m,,, > k and so this 
concludes the proof of the soundness condition. 


is an independent set in the graph G of size |7| 


ine='sfdp’) 


‘or v in G.nodes(): 
H.node(v, shape="square”) 
H.edge(s,v) 


r (u,v) in G.edges(): 


e(g 


( 
e 
£ 
H 
H.node(g2, 1al 
H 

j 


A(T 81), (5,82), (81,82), (u81), (v,82)]) 
+1 


return H 


14.8 REDUCING 3SAT TO LONGEST PATH 


Note: This section is still a little messy; feel free to skip it or just read 
it without going into the proof details. The proof appears in Section 
7.5 in Sipser’s book. 

One of the most basic algorithms in Computer Science is Dijkstra’s 
algorithm to find the shortest path between two vertices. We now show 
that in contrast, an efficient algorithm for the longest path problem 
would imply a polynomial-time algorithm for 3SAT. 


Theorem 14.12 — Hardness of longest path. 
3SAT <, LONGPATH 
Proof Idea: 


To prove Theorem 14.12 need to show how to transform a 3CNF 
formula y into a graph G and two vertices s,¢ such that G has a path 


Both v;, vj cut from s* 


One or both v;, vj same side as s* 


4 edges cut < 3 edges cut 


Figure 14.11: In the reduction of independent set 
to max cut, for every t € [m], we havea Reno 


rrespondi 
Fic EAA 4.12: 


Pref gine T 
anno tk ae t ae 4 wand 


ab ee ned q 

a Sara URS, Ge RA band AAA 

BTS pide Be Jeko A Cue 
e 


ite ra cyt Cet e Be To cas n 

py en ue alip SSI Sh near onec T R 
FEN ey od odre ante Fen Fa aber eae ev 3): 
color the intermediate vertices e? , el, we will cut at 
most three edges from the gadget. The figure above 
contains only the gadget edges and ignores the edges 


connecting s* to the vertices vg, ..., Un—1- 


Figure 14.13: We can transform a 3SAT formula y into 
a graph G such that the longest path in the graph G 
would correspond to a satisfying assignment in y. In 
this graph, the black colored part corresponds to the 
variables of y and the blue colored part corresponds 
to the vertices. A sufficiently long path would have to 
first “snake” through the black part, for each variable 
choosing either the “upper path” (corresponding 

to assigning it the value True) or the “lower path” 
(corresponding to assigning it the value False). Then 
to achieve maximum length the path would traverse 
through the blue part, where to go between two 
vertices corresponding to a clause such as £17 V Z392 V 
£57, the corresponding vertices would have to have 
been not traversed before. 
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of length at least k if and only if is satisfiable. The idea of the reduc- 
tion is sketched in Fig. 14.13 and Fig. 14.14. We will construct a graph 
that contains a potentially long “snaking path” that corresponds to 
all variables in the formula. We will add a “gadget” corresponding 

to each clause of y in a way that we would only be able to use the 
gadgets if we have a satisfying assignment. 


* 
def TSAT2LONGPATH(ọ) : Figure 14.14: The graph above with the longest path 
marked on it, the part of the path corresponding to 
"""Reduce 3SAT to LONGPATH""" variables is in green and part corresponding to the 
def var(v): # return variable and True/False depending clauses is in pink. 


ə» if positive or negated 
return int(v[2:]),False if v[0]=="=" else 
ə int(v[1:]),True 
n = numvars(9) 
clauses = getclauses(o) 
m = len(clauses) 
G =Graph() 
G.edge("start", "start_0") 
for i in range(n): # add 2 length-m paths per variable 
G.edge(f"start_{i}", f"v_{i}_{0}_T") 
G.edge(f"start_{i}", f"v_{i}_{Q}_F") 
for j in range(m-1): 
G.edge(f"v_{i}_{j}_T", f"v_{i}_{j+1}_T") 
G.edge(f"v_{i}_{j}_F", f"v_{i}_{j+1}_F") 
G.edge(f"v_{i}_{m-1}_T", f"end_{i}") 
G.edge(f"v_{i}_{m-1}_F", f"end_{i}") 
if i<n-1: 

G.edge(f"end_{i}", f"start_{i+1}") 
G.edge(f"end_{n-1}", "start_clauses") 
for j,C in enumerate(clauses): # add gadget for each 

o clause 
for v in enumerate(C): 

i,sign = var(v[1]) 

s = "F" if sign else "T" 

G.edge(f"C_{j}-in",f"v_{i}-{j}-{s}") 

G.edge(f"v_{i}_{j}_{s}", f"C_{j}_out") 

if j<m-1: 

G.edge(f"C_{j}_out", f"C_{j+1}_in") 
G.edge("start_clauses", "C_Q_in") 
G.edge(f"C_{m-1}_out", "end") 
return G, 1+n*(m+1)+1+2*m+1 


Proof of Theorem 14.12. We build a graph G that “snakes” from s to t as 
follows. After s we add a sequence of n long loops. Each loop has an 
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“upper path” and a “lower path”. A simple path cannot take both the 
upper path and the lower path, and so it will need to take exactly one 
of them to reach s from t. 

Our intention is that a path in the graph will correspond to an as- 
signment z € {0,1}” in the sense that taking the upper path in the it” 
loop corresponds to assigning x; = 1 and taking the lower path cor- 
responds to assigning xz; = 0. When we are done snaking through all 
the n loops corresponding to the variables to reach t we need to pass 
through m “obstacles”: for each clause j we will have a small gad- 

j t; that have three paths between 
them. For example, if the jt” clause had the form 17 V %55 V £73 then 


get consisting of a pair of vertices s 


one path would go through a vertex in the lower loop corresponding 
to x17, one path would go through a vertex in the upper loop corre- 
sponding to %55 and the third would go through the lower loop cor- 
responding to x75. We see that if we went in the first stage according 
to a satisfying assignment then we will be able to find a free vertex to 
travel from s; to t;. We link t; to 52, ty to sz, etc and link t, to t. Thus 
a satisfying assignment would correspond to a path from s to t that 
goes through one path in each loop corresponding to the variables, 
and one path in each loop corresponding to the clauses. We can make 
the loop corresponding to the variables long enough so that we must 
take the entire path in each loop in order to have a fighting chance of 
getting a path as long as the one corresponds to a satisfying assign- 
ment. But if we do that, then the only way if we are able to reach t is 
if the paths we took corresponded to a satisfying assignment, since 
otherwise we will have one clause j where we cannot reach t; from s; 
without using a vertex we already used before. 

a 


14.8.1 Summary of relations 

We have shown that there are a number of functions F for which we 
can prove a statement of the form “If F € P then 3SAT € P”. Hence 
coming up with a polynomial-time algorithm for even one of these 
problems will entail a polynomial-time algorithm for 3SAT (see for 
example Fig. 14.16). In Chapter 15 we will show the inverse direction 
(“If 3SAT € P then F € P”) for these functions, hence allowing us to 
conclude that they have equivalent complexity to 3SAT. 


© Chapter Recap 


e The computational complexity of many seemingly 
unrelated computational problems can be related 
to one another through the use of reductions. 


Figure 14.15: The result of applying the reduction of 
3SAT to LONGPATH to the formula (ag V 7x3 V £2) A 
(“£o V z1 V `T3) A (£1 V £3 V 723). 


Likely Possible Impossible 


Allfunctions F: {0,1}° > {0,1} Allfunctions F:(0,1}° > (0,1) Mbgunctions F: (0,1}' + {0,1} 
R Computable functions He iaiT yr R Computable functions Ke matty Prtgble functions 


e IfF Ss, G then a polynomial-time algorithm 


for G can be transformed into a polynomial-time 
algorithm for F. 


e Equivalently, if F < G and F does not have a 


P 
polynomial-time algorithm then neither does G. 


e We've developed many techniques to show that 
3SAT <, F for interesting functions F’. Sometimes 
we can do so by using transitivity of reductions: if 
3SAT <, G and G <, F then 3SAT <, F. 


14.9 EXERCISES 
14.10 BIBLIOGRAPHICAL NOTES 


Several notions of reductions are defined in the literature. The notion 
defined in Definition 14.1 is often known as a mapping reduction, many 
to one reduction or a Karp reduction. 

The maximal (as opposed to maximum) independent set is the task 
of finding a “local maximum” of an independent set: an independent 
set S such that one cannot add a vertex to it without losing the in- 
dependence property (such a set is known as a vertex cover). Unlike 
finding a maximum independent set, finding a maximal independent 
set can be done efficiently by a greedy algorithm, but this local maxi- 
mum can be much smaller than the global maximum. 

Reduction of independent set to max cut taken from these notes. 
Image of Hamiltonian Path through Dodecahedron by Christoph 
Sommer. 

We have mentioned that the line between reductions used for algo- 
rithm design and showing hardness is sometimes blurry. An excellent 
example for this is the area of SAT Solvers (see [Gom+08]). In this 
field people use algorithms for SAT (that take exponential time in the 
worst case but often are much faster on many instances in practice) 
together with reductions of the form F <, SAT to derive algorithms 
for other functions F of interest. 
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Figure 14.16: So far we have shown that P C EXP and 
that several problems we care about such as 3SAT and 
MAXCUT are in EXP but it is not known whether or 
not they are in P. However, since 3SAT <, MAXCUT 
we can rule out the possiblity that MAXCUT € P but 
3SAT ¢ P. The relation of P poly to the class EXP is 
not known. We know that EXP does not contain P poly 
since the latter even contains uncomputable functions, 
but we do not know whether ot not EXP C P poly 
(though it is believed that this is not the case and in 
particular that both 3SAT and MAXCUT are not in 


P poly): 


Learning Objectives: 


e Introduce the class NP capturing a great many 
important computational problems 


e NP-completeness: evidence that a problem 
might be intractable. 


e The P vs NP problem. 


15 
NP, NP completeness, and the Cook-Levin Theorem 


“In this paper we give theorems that suggest, but do not imply, that these 
problems, as well as many others, will remain intractable perpetually”, Richard 
Karp, 1972 


“Sad to say, but it will be many more years, if ever before we really understand 
the Mystical Power of Twoness... 2-SAT is easy, 3-SAT is hard, 2-dimensional 
matching is easy, 3-dimensional matching is hard. Why? oh, Why?” Eugene 
Lawler 


So far we have shown that 3SAT is no harder than Quadratic Equa- 
tions, Independent Set, Maximum Cut, and Longest Path. But to show 
that these problems are computationally equivalent we need to give re- 
ductions in the other direction, reducing each one of these problems to 
3SAT as well. It turns out we can reduce all three problems to 3SAT in 
one fell swoop. 

In fact, this result extends far beyond these particular problems. All 
of the problems we discussed in Chapter 14, and a great many other 
problems, share the same commonality: they are all search problems, 
where the goal is to decide, given an instance x, whether there exists 
a solution y that satisfies some condition that can be verified in poly- 
nomial time. For example, in 3SAT, the instance is a formula and the 
solution is an assignment to the variable; in Max-Cut the instance is a 
graph and the solution is a cut in the graph; and so on and so forth. It 
turns out that every such search problem can be reduced to 3SAT. 


Compiled on 12.19.2022 22:58 
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3NAND 


(x4 = NAND (xz, x3)) A (xg = NAND(x4,X5)) A = 


3SAT 


(Xo V X2 V x7) A (X5 V X2 V X12) A (X9 A X2 A Xo) ** 


Figure 15.1: Overview of the results of this chapter. 
We define NP to contain all decision problems for 
which a solution can be efficiently verified. The main 
result of this chapter is the Cook Levin Theorem (The- 
orem 15.6) which states that 3SAT has a polynomial- 
time algorithm if and only if every problem in NP 
has a polynomial-time algorithm. Another way to 
state this theorem is that 3SAT is NP complete. We 
will prove the Cook-Levin theorem by defining the 
two intermediate problems NANDSAT and 3NAND, 
proving that NANDSAT is NP complete, and then 
proving that NANDSAT <, 3NAND <, 3SAT. 
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15.1 THE CLASS NP 


To make the above precise, we will make the following mathematical 
definition. We define the class NP to contain all Boolean functions that 
correspond to a search problem of the form above. That is, a Boolean 
function F is in NP if F has the form that on input a string x, F(x) = 1 
if and only if there exists a “solution” string w such that the pair (x, w) 
satisfies some polynomial-time checkable condition. Formally, NP is 
defined as follows: 


0/1 
Definition 15.1 — NP. We say that F : {0,1}* — {0,1} is in NP if there 
exists some integer a > Oand V : {0,1}* — {0,1} such that V € P 
and for every x € {0,1}", 
F(a) i 1 Ss =) (0 1} a s.t V(cw) — 1 (15 1) 011011100111011011011011000 /101001001101111011101110110110110000111 
= Awefo,1yn? Set. =i. : 
T 
X w 
|x] =n |w| =n? 


In other words, for F to be in NP, there needs to exist some 
Figure 15.2: The class NP corresponds to problems 


where solutions can be efficiently verified. That is, this 
F(x) = 1 then there must exist w (of length polynomial in |x|) such is the class of functions F such that F(x) = 1 if there 


that V (zw) = 1, and if F(x) = 0 then for every such w, V (zw) = 0. is a “solution” w of length polynomial in |x| that can 
be verified by a polynomial-time algorithm V. 


polynomial-time computable verification function V, such that if 


Since the existence of this string w certifies that F(x) = 1, w is often 
referred to as a certificate, witness, or proof that F(x) = 1. 

See also Fig. 15.2 for an illustration of Definition 15.1. The name 
NP stands for “non-deterministic polynomial time” and is used for 
historical reasons; see the bibiographical notes. The string w in (15.1) 
is sometimes known as a solution, certificate, or witness for the instance 


T. 


Solved Exercise 15.1 — Alternative definition of NP. Show that the condition 
that |w| = |2|* in Definition 15.1 can be replaced by the condition 
that |w| < p(|z|) for some polynomial p. That is, prove that for every 
F : {0,1}* — {0,1}, F € NP if and only if there is a polynomial- 
time Turing machine V and a polynomial p : N — N such that for 
every x € {0,1}* F(x) = 1 if and only if there exists w € {0,1}* with 
|w| < p(|a|) such that V(x, w) = 1. 


Solution: 

The “only if” direction (namely that if F € NP then there is an 
algorithm V and a polynomial p as above) follows immediately 
from Definition 15.1 by letting p(n) E n*. For the “if” direc- 
tion, the idea is that if a string w is of size at most p(n) for degree 
d polynomial p, then there is some ng such that foralln > nọ, 
|w| < ntl, Hence we can encode w by a string of exactly length 
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n@* by padding it with 1 and an appropriate number of zeroes. 
Hence if there is an algorithm V and polynomial p as above, then 
we can define an algorithm V’ that does the following on input 
x,w’ with |z| = n and |w’| = n*: 


e Ifn < ng then V’(z,w’) ignores w’ and enumerates over all w 
of length at most p(n) and outputs 1 if there exists w such that 
V(a,w) = 1. (Since n < ng, this only takes a constant number of 
steps.) 


e Ifn > ng then V’(z, w’) “strips out” the padding by dropping 
all the rightmost zeroes from w until it reaches out the first 1 
(which it drops as well) and obtains a string w. If|w| < p(n) 
then V’ outputs V (x, w). 


Since V runs in polynomial time, V’ runs in polynomial time 
as well, and by definition for every x, there exists w € {0,1}? 
such that V’(xw’) = 1lifand only if there exists w € {0,1}* with 
|w| < p(|z|) such that V (xw) = 1. 


The definition of NP means that for every F € NP and string 
x € {0,1}*, F(x) = 1ifand only if there is a short and efficiently 
verifiable proof of this fact. That is, we can think of the function V in 
Definition 15.1 as a verifier algorithm, similar to what we’ve seen in 
Section 11.1. The verifier checks whether a given string w € {0,1}* isa 
valid proof for the statement “F (x) = 1”. Essentially all proof systems 
considered in mathematics involve line-by-line checks that can be car- 
ried out in polynomial time. Thus the heart of NP is asking for state- 
ments that have short (i.e., polynomial in the size of the statements) 
proofs. Indeed, as we will see in Chapter 16, Kurt Gödel phrased the 
question of whether NP = P as asking whether “the mental work of 
a mathematician [in proving theorems] could be completely replaced 


by a machine”. 
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15.1.1 Examples of functions in NP 
We now present some examples of functions that are in the class NP. 
We start with the canonical example of the 3SAT function. 


a Example 15.3 — 3S AT E€ NP. 3SAT is in NP since for every £- 
variable formula y, 3SAT(y) = 1 if and only if there exists a 
satisfying assignment x € {0,1}‘suchthaty(x) = 1,andwe 
can check this condition in polynomial time. 

The above reasoning explains why 3SAT is in NP, but since this 
is our first example, we will now belabor the point and expand out 
in full formality the precise representation of the witness w and the 
algorithm V that demonstrate that 3SAT is in NP. Since demon- 
strating that functions are in NP is fairly straightforward, in future 
cases we will not use as much detail, and the reader can also feel 
free to skip the rest of this example. 

Using Solved Exercise 15.1, it is OK if witness is of size at most 
polynomial in the input length n, rather than of precisely size n° 
for some integera > 0. Specifically, we can represent a 3CNF 
formula y with k variables and m clauses as a string of length 
n = O(mlogk), since every one of the m clauses involves three 
variables and their negation, and the identity of each variable can 
be represented using |log,, k]. We assume that every variable par- 
ticipates in some clause (as otherwise it can be ignored) and hence 
that m > k, which in particular means that the input length n is at 
least as large as m and k. 

We can represent an assignment to the k variables using a k- 
length string w. The following algorithm checks whether a given w 
satisfies the formula yọ: 
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Algorithm 15.4 takes O(m) time to enumerate over all clauses, 
and will return 1 if and only i y satisfies all the clauses. 


Here are some more examples for problems in NP. For each one 
of these problems we merely sketch how the witness is represented 
and why it is efficiently checkable, but working out the details can be a 
good way to get more comfortable with Definition 15.1: 


e QUADEQ is in NP since for every l-variable instance of quadratic 
equations E, QUADEQ(E£) = 1 if and only if there exists an assign- 
ment x € {0,1} that satisfies Æ. We can check the condition that 
x satisfies E in polynomial time by enumerating over all the equa- 
tions in F, and for each such equation e, plug in the values of x and 
verify that e is satisfied. 


ISET is in NP since for every graph G and integer k, ISET(G,k) = 
1 if and only if there exists a set S of k vertices that contains no 
pair of neighbors in G. We can check the condition that S$ is an 
independent set of size > k in polynomial time by first checking 
that |S| > & and then enumerating over all edges {u, v} in G, and 
for each such edge verify that either u ¢ S or v ¢ S. 


e LONGPATH is in NP since for every graph G and integer k, 
LONGPATH(G,k) = 1 if and only if there exists a simple path P 
in G that is of length at least k. We can check the condition that P 
is a simple path of length k in polynomial time by checking that it 
has the form (vp, v1, --- , Vp) Where each v, is a vertex in G, no v; is 
repeated, and for every i € [k], the edge {v;,v,,,} is present in the 
graph. 


e MAXCUT is in NP since for every graph G and integer k, 
MAXCUT(G, k) = 1 if and only if there exists a cut (S, S) in G that 
cuts at least k edges. We can check that condition that (S, S) is a 
cut of value at least k in polynomial time by checking that S is a 
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subset of G’s vertices and enumerating over all the edges {u, v} of 
G, counting those edges such that u € S and v ¢ S or vice versa. 


15.1.2 Basic facts about NP 

The definition of NP is one of the most important definitions of this 
book, and is worth while taking the time to digest and internalize. The 
following solved exercises establish some basic properties of this class. 
As usual, I highly recommend that you try to work out the solutions 
yourself. 


Solved Exercise 15.2 — Verifying is no harder than solving. Prove that P C NP. 


Solution: 

Suppose that F € P. Define the following function V: V(<0") = 
liffn = |x| and F(x) = 1. (V outputs 0 on all other inputs.) Since 
F € P we can clearly compute V in polynomial time as well. 

Let x € {0,1}" be some string. If F(x) = 1 then V(x0") = 1. On 
the other hand, if F(x) = 0 then for every w € {0,1}",V(aw) = 0. 
Therefore, settinga = l(ie.w € {0, 1}”"), we see that V satisfies 
(15.1), and establishes that F € NP. 


Solved Exercise 15.3 — NP is in exponential time. Prove that NP C EXP. 


Solution: 

Suppose that F € NP and let V be the polynomial-time com- 
putable function that satisfies (15.1) and a the corresponding 
constant. Then giveneveryx €  {0,1}", we can check whether 


F(x) = lintime poly(n) - 2" = 0(2"°”*) by enumerating over 
all the 2" strings w € {0,1}"" and checking whether V (xw) = 1, 
in which case we return 1. If V(xw) = 0 for every such w then we 


return 0. By construction, the algorithm above will run in time at 
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most exponential in its input length and by the definition of NP it 
will return F(x) for every z. 
a 


Solved Exercise 15.2 and Solved Exercise 15.3 together imply that 


PCNPCEXP. 


The time hierarchy theorem (Theorem 13.9) implies that P Ç EXP 
and hence at least one of the two inclusions P C NP or NP C EXP 
is strict. Itis believed that both of them are in fact strict inclusions. 
That is, it is believed that there are functions in NP that cannot be 
computed in polynomial time (this is the P # NP conjecture) and 
that there are functions F in EXP for which we cannot even effi- 
ciently certify that F(x) = 1 for a given input zx. One function F 
that is believed to lie in EXP \ NP is the function 3SAT defined as 
3SAT(y) = 1 — 3SAT(v) for every 3CNF formula y. The conjecture 
that 3SAT ¢ NP is known as the “NP # co — NP” conjecture. It 
implies the P 4 NP conjecture (see Exercise 15.2). 

We have previously informally equated the notion of F < p G with 


F being “no harder than G” and in particular have seen in Solved 
Exercise 14.1 that if G € P and F <p G, then F € P as well. The 
following exercise shows that if F <, G then it is also “no harder to 
verify” than G. That is, regardless of whether or not it is in P, if G has 
the property that solutions to it can be efficiently verified, then so does 
F. 


Solved Exercise 15.4 — Reductions and NP. Let F,G : {0,1}* — {0,1}. 
Show that if F <, G and G € NP then F € NP. 


Solution: 

Suppose that G is in NP and in particular there exists a and V € 
P such that for every y € {0,1}*, G(y) = 1 Aveo spe V (yw) = 1. 
Suppose also that F <p G and so in particular there is a n°- 
time computable function R such that F(x) = G(R({x)) forall 


x € {0,1}. Define V’ to be a Turing machine that on input a pair 


(x, w) computesy = R(x) and returns 1 if and only if |w| = |y|* 
and V(yw) = 1. Then V’ runs in polynomial time, and for every 
x € {0,1}*, F(x) = 1 iff there exists w of size | R(x)|* which is at 
most polynomial in |x| such that V(x, w) = 1, hence demonstrating 
that F € NP. 
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15.2 FROM NP TO 3SAT: THE COOK-LEVIN THEOREM 


We have seen several examples of problems for which we do not know 
if their best algorithm is polynomial or exponential, but we can show 
that they are in NP. That is, we don’t know if they are easy to solve, but 
we do know that it is easy to verify a given solution. There are many, 
many, many, more examples of interesting functions we would like to 
compute that are easily shown to be in NP. What is quite amazing is 
that if we can solve 3SAT then we can solve all of them! 

The following is one of the most fundamental theorems in Com- 
puter Science: 


| Theorem 15.6 — Cook-Levin Theorem. For every F € NP, F Sp 3SAT. 


We will soon show the proof of Theorem 15.6, but note that it im- 
mediately implies that QUADEQ, LONGPATH, and MAXCUT all 
reduce to 3SAT. Combining it with the reductions we’ve seen in Chap- 
ter 14, it implies that all these problems are equivalent! For example, 
to reduce QUADEQ to LONGPATH, we can first reduce QUADEQ to 
3SAT using Theorem 15.6 and use the reduction we’ve seen in Theo- 
rem 14.12 from 3SAT to LONGPATH. That is, since QUADEQ € NP, 
Theorem 15.6 implies that QUADEQ <, 3SAT, and Theorem 14.12 
implies that 3SAT <, LONGPATH, which by the transitivity of reduc- 
tions (Solved Exercise 14.2) means that QUADEQ <,, LONGPATH. 
Similarly, since LONGPATH € NP, we can use Theorem 15.6 and 
Theorem 14.4 to show that LONGPATH <, 3SAT <, QUADEQ, 
concluding that LONGPATH and QUADEQ are computationally 
equivalent. 

There is of course nothing special about QUADEQ and LONGPATH 
here: by combining (15.6) with the reductions we saw, we see that just 
like 3SAT, every F € NP reduces to LONGPATH, and the same is true 
for QUADEQ and MAXCUT. All these problems are in some sense 
“the hardest in NP” since an efficient algorithm for any one of them 
would imply an efficient algorithm for all the problems in NP. This 
motivates the following definition: 


Definition 15.7 — NP-hardness and NP-completeness. Let ŒG : {0,1}* — 
{0,1}. We say that G is NP hard if for every F € NP, F <, G. 
We say that G is NP complete if G is NP hard and G € NP. 


The Cook-Levin Theorem (Theorem 15.6) can be rephrased as 
saying that 3SAT is NP hard, and since it is also in NP, this means that 
3SAT is NP complete. Together with the reductions of Chapter 14, 
Theorem 15.6 shows that despite their superficial differences, 3SAT, 
quadratic equations, longest path, independent set, and maximum 
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cut, are all NP-complete. Many thousands of additional problems 
have been shown to be NP-complete, arising from all the sciences, 
mathematics, economics, engineering and many other fields. (For a 
few examples, see this Wikipedia page and this website.) 


15.2.1 What does this mean? 

As we’ve seen in Solved Exercise 15.2, P C NP. The most famous con- 
jecture in Computer Science is that this containment is strict. That is, 

it is widely conjectured that P # NP. One way to refute the conjec- 
ture that P # NP is to give a polynomial-time algorithm for even a 
single one of the NP-complete problems such as 3SAT, Max Cut, or 

the thousands of others that have been studied in all fields of human 
endeavors. The fact that these problems have been studied by so many 
people, and yet not a single polynomial-time algorithm for any of 
them has been found, supports that conjecture that indeed P # NP. In 
fact, for many of these problems (including all the ones we mentioned 
above), we don’t even know of a 2°")-time algorithm! However, to the 
frustration of computer scientists, we have not yet been able to prove 
that P + NP or even rule out the existence of an O(n)-time algorithm 
for 3SAT. Resolving whether or not P = NP is known as the P vs NP 
problem. A million-dollar prize has been offered for the solution of 
this problem, a popular book has been written, and every year a new 
paper comes out claiming a proof of P = NP or P + NP, only to wither 
under scrutiny. 

One of the mysteries of computation is that people have observed a 
certain empirical “zero-one law” or “dichotomy” in the computational 
complexity of natural problems, in the sense that many natural prob- 
lems are either in P (often in TIME(O(n)) or TIME(O(n?))), or they 
are are NP hard. This is related to the fact that for most natural prob- 
lems, the best known algorithm is either exponential or polynomial, 
with not too many examples where the best running time is some 
strange intermediate complexity such as 22¥"5" However, it is be- 
lieved that there exist problems in NP that are neither in P nor are NP- 
complete, and in fact a result known as “Ladner’s Theorem” shows 
that if P # NP then this is indeed the case (see also Exercise 15.1 and 
Fig. 15.3). 


P= Wea nP- fith 


Figure 15.3: The world if P + NP (left) and P = NP 
(right). In the former case the set of NP-complete 
problems is disjoint from P and Ladner’s theorem 
shows that there exist problems that are neither in 

P nor are NP-complete. (There are remarkably few 
natural candidates for such problems, with some 
prominent examples being decision variants of 
problems such as integer factoring, lattice shortest 
vector, and finding Nash equilibria.) In the latter case 
that P = NP the notion of NP-completeness loses its 
meaning, as essentially all functions in P (save for the 
trivial constant zero and constant one functions) are 
NP-complete. 


Figure 15.4: A rough illustration of the (conjectured) 
status of problems in exponential time. Darker colors 
correspond to higher running time, and the circle in 
the middle is the problems in P. NP is a (conjectured 
to be proper) superclass of P and the NP-complete 
problems (or NPC for short) are the “hardest” prob- 
lems in NP, in the sense that a solution for one of 
them implies a solution for all other problems in NP. 
It is conjectured that all the NP-complete problems 
require at least exp(n‘) time to solve for a constant 

c€ > 0, and many require exp(Q(n)) time. The per- 
manent is not believed to be contained in NP though 
it is NP-hard, which means that a polynomial-time 
algorithm for it implies that P = NP. 
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15.2.2 The Cook-Levin Theorem: Proof outline 
We will now prove the Cook-Levin Theorem, which is the underpin- 
ning to a great web of reductions from 3SAT to thousands of problems 
across many great fields. Some problems that have been shown to be 
NP-complete include: minimum-energy protein folding, minimum 
surface-area foam configuration, map coloring, optimal Nash equi- 
librium, quantum state entanglement, minimum supersequence of 
a genome, minimum codeword problem, shortest vector in a lattice, 
minimum genus knots, positive Diophantine equations, integer pro- 
gramming, and many many more. The worst-case complexity of all 
these problems is (up to polynomial factors) equivalent to that of 
3SAT, and through the Cook-Levin Theorem, to all problems in NP. 
To prove Theorem 15.6 we need to show that F <, 3SAT for every 
F € NP. We will do so in three stages. We define two intermediate 
problems: NANDSAT and 3NAND. We will shortly show the def- 
initions of these two problems, but Theorem 15.6 will follow from 
combining the following three results: 


1. NANDSAT is NP hard (Lemma 15.8). 
2. NANDSAT <, 3NAND (Lemma 15.9). 


3. 3NAND <, 3SAT (Lemma 15.10). 


By the transitivity of reductions, it will follow that for every F € 
NP, 


F <, NANDSAT <, 3NAND <, 3SAT 


hence establishing Theorem 15.6. 

We will prove these three results Lemma 15.8, Lemma 15.9 and 
Lemma 15.10 one by one, providing the requisite definitions as we go 
along. 


15.3 THE NANDSAT PROBLEM, AND WHY IT IS NP HARD 
The function NANDSAT : {0, 1}* > {0, 1} is defined as follows: 


e The input to NANDSAT is a string Q representing a NAND-CIRC 
program (or equivalently, a circuit with NAND gates). 


e The output of NANDSAT on input Q is 1 if and only if there exists a 
string w € {0,1}” (where n is the number of inputs to Q) such that 
Q(w) = 1. 


Solved Exercise 15.5 — NANDSAT € NP. Prove that NANDSAT € NP. 
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Solution: 

We have seen that the circuit (or straightline program) evalua- 
tion problem can be computed in polynomial time. Specifically, 
given a NAND-CIRC program Q of s lines and n inputs, and 
w € {0,1}", we can evaluate Q on the input w in time which is 
polynomial in s and hence verify whether or not Q(w) = 1. 


We now prove that NANDSAT is NP hard. 
Lemma 15.8 NANDSAT is NP hard. 


Proof Idea: 

The proof closely follows the proof that P C P pory (Theorem 13.12 
, see also Section 13.6.2). Specifically, if F € NP then there is a poly- 
nomial time Turing machine M and positive integer a such that for 
every x € {0,1}", F(x) = 1 iff there is some w € {0,1}"" such that 
M(xw) = 1. The proof that P C P pory gave us a way (via “unrolling 
the loop”) to come up in polynomial time with a Boolean circuit C on 
n® inputs that computes the function w + M(xw). We can then trans- 
late C into an equivalent NAND circuit (or NAND-CIRC program) Q. 
We see that there is a string w € {0,1}”" such that Q(w) = 1 if and 
only if there is such w satisfying M(xw) = 1 which (by definition) 
happens if and only if F(x) = 1. Hence the translation of x into the 
circuit Q is a reduction showing F <, NANDSAT. 

* 


Proof of Lemma 15.8. Let F € NP. To prove Lemma 15.8 we need to 
give a polynomial-time computable function that will map every z* € 
{0, 1}* to a NAND-CIRC program Q such that F(x) = NANDSAT(Q). 

Let z* € {0,1}* be such a string and letn = |x*| be its length. By 
Definition 15.1 there exists V € P and positive a € N such that F(x*) = 
1 if and only if there exists w € {0,1}"" satisfying V (x*w) = 1. 
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Let m = n*. Since V € P there is some NAND-IM program P* that 
computes V on inputs of the form zw with x € {0,1}" and w € {0, 1}” 
in at most (n + m)° time for some constant c. Using our “unrolling 
the loop NAND-TM to NAND compiler” of Theorem 13.14, we can 
obtain a NAND-CIRC program Q’ that has n + m inputs and at most 
O((n + m)?°) lines such that Q’(xw) = P*(xw) for every x € {0,1}” 
and w € {0,1}™. 

We can then use a simple “hardwiring” technique, reminiscent of 
Remark 9.11 to map Q’ into a circuit/ NAND-CIRC program Q on m 
inputs such that Q(w) = Q’ (x*w) for every w € {0,1}. 

CLAIM: There is a polynomial-time algorithm that on input a 
NAND.-CIRC program Q’ on n + m inputs and x* € {0,1}", outputs 
a NAND-CIRC program Q such that for every w € {0,1}", Q(w) = 
Q’ (a*w). 

PROOF OF CLAIM: We can do so by adding a few lines to ensure 
that the variables zero and one are 0 and 1 respectively, and then 
simply replacing any reference in Q’ to an input x; with i € [n] the 
corresponding value based on z}. See Fig. 15.5 for an implementation 
of this reduction in Python. 

Our final reduction maps an input «*, into the NAND-CIRC pro- 
gram Q obtained above. By the above discussion, this reduction runs 
in polynomial time. Since we know that F'(x*) = 1 if and only if there 
exists w € {0,1} such that P*(x*w) = 1, this means that F'(a*) = 1 if 
and only if NANDSAT(Q) = 1, which is what we wanted to prove. 


e E aes PEE PCE) Figure 15.5: Given an T-line NAND-CIRC program 
Aeshna PAGE u = NAND(X[0],X[1]) temp = NAND(X[0],X[0]) Q that has n + m inputs and some z* € {0,1}”, 
vi v = NAND( X[@] , u) ne = NAND(X[0], temp) a 7 
pier) w= NAND( XDI] > u) karo = NAND (eñe ofe). we can transform Q into a T + 3 line NAND-CIRC 
for i in range(n): A E T E program Q’ that computes the map w ++ Q(a*w) 
Q = Q.replace(f'X[{i}]',('one' if x[i] else 'zero')) A P a w = NAND( ) m ` > 
w = NAND( X[2] , u) s =NAND(v ,w) for w € {0, 1} by simply adding code to compute 
# move PE a EE E o e AE E PEA s = ( , w) = ( i á 
PER MIEN COV, ca WOCe Cai a E a aye the zero and one constants, replacing all references to 
Q = Q-replace(f'x[{i}]', f'X[{i-n}]") e ATS wy X[i] with either zero or one depending on the value 
return CONSTPREFIX+Q a Tmamts xt) vo omats suo of xž, and then replacing the remaining references 
= ue v =NAND(s ,u) v = NAND( X[0] , u) x * : $ 
EAR E W = NAND( X[4] » u) eD GT ey to X[j] with XCj —n]. Above is Python code that 
one = NAND(x[0] , temp) HEOI SINANDI pe DPN. implements this transformation, as well as an example 
zero = NAND(one,one) C NAND( XE] aa fi 7 : l 
HATT VEENAS CW) of its execution on a simple program. 


15.4 THE 3NAND PROBLEM 
The 3NAND problem is defined as follows: 


e The input is a logical formula W on a set of variables zo, ... , 2,1 
which is an AND of constraints of the form z; = NAND(z;, zp). 
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e The output is 1 if and only if there is an input z € {0,1}" that 
satisfies all of the constraints. 


For example, the following is a 3NAND formula with 5 variables 
and 3 constraints: 


Y = (z, = NAND (2p, 22)) A(z, = NAND(2p, z2))A (z4 = NAND(Zs, 21)) - 


In this case 3NAND(WV) = 1 since the assignment z = 01010 satisfies 
it. Given a 3NAND formula Y on r variables and an assignment z € 
{0, 1}", we can check in polynomial time whether (z) = 1, and hence 
3NAND € NP. We now prove that 3NAND is NP hard: 


Lemma 15.9 NANDSAT Š; 3NAND. 


Proof Idea: 

To prove Lemma 15.9 we need to give a polynomial-time map from 
every NAND-CIRC program Q to a 3NAND formula Y such that there 
exists w such that Q(w) = 1 if and only if there exists z satisfying Y. 
For every line 7 of Q, we define a corresponding variable z; of WV. If 
the line 7 has the form foo = NAND(bar ,blah) then we will add the 
clause z; = NAND(z,, z,) where j and k are the last lines in which bar 
and blah were written to. We will also set variables corresponding 
to the input variables, as well as add a clause to ensure that the final 
output is 1. The resulting reduction can be implemented in about a 
dozen lines of Python, see Fig. 15.6. 

* 


print(xor5) 


# y fia 4 u = NAND(X[0],X[1]) 
def NANDSAT23NAND_(Q) : v = NAND( X[0] , u) 
Q = CONSTPREFIX + Q w = NAND( X[1] , u) 
uE muwinout (Q) s = NAND( Vv, w) 
u =NAND(s , X[2]) 
#varidx[u] is n+line where u a v = NAND( s » u) 
varidx = defaultdict(lambda : n+ 2) # w = NAND( X[2] , u) 
for i in range(n): varidx[f'x[{i}]'] = s = NAND( v w) 
u = NAND( s , X[3]) 
gon v =NAND(s ,u) 
for line in Q.sptit('\n'): ý oe 
if not line.strip(): continue w = NAND( X[3] , u) 
foo,bar,blah = splitline(line) # it "foo = NAND(ba s = NAND( v » w) 
W += f"(z{j} = NAND(z{varidx[bar]},z{varidx[blah]}) ) A" u =NAND(s , X[4]) 
varidx[foo] = j v= NANDCS o i) 
j+1 - 
Y += f"(z{varidx['Y[0]']} = NAND(z{varidx['zero']},z{varidx['zero']}) )" w = NAND( X[4] , u) 
return ¥ Y[0] = NAND( v , w) 


print (NANDSAT23NAND_(xor5) ) 


(z5 = NAND(Z0,z0) ) A (z6 = NAND(z0,z5) NAND(26,26) ) A (z8 = NAND(z0,z1) ) A (z9 = NAND(z 
@,z8) ) A (z10 = NAND(z1,z8) ) A (zll = ( z10) ) A (z12 = NAND(z211,z2) ) A (213 = NAND(z11,z1 


2) ) A (214 = NAND(z2,z12) ) A (z15 = 4) ) A (z16 = NAND(z15,z3) ) A (z17 = NAND(z15,z16) 
) A (z18 = NAND(z3,z16) ) A (z19 = NAND(z17 ) 20 = NAND(z19,z4) ) A (z21 = NAND(z19,z20) ) 
A (z22 = NAND(z24,z20) ) A (z23 = NAND(221,z22) ) A (z23 = NAND(z7,z7) ) 


eval3NAND(NANDSAT23NAND_(xor5),[1, ©, ©, 1, 1, ©, 1, ©, 1, ©, 1, 1, 1, ©, 1, 1, ©, 1, 1, ©, 1, 1, 0, 1]) 


True 


Figure 15.6: Python code to reduce an instance Q of 
NANDSAT to an instance Y of 3NAND. In the exam- 
ple above we transform the NAND-CIRC program 
xor5 which has 5 input variables and 16 lines, into 

a 3NAND formula W that has 24 variables and 20 
clauses. Since xor5 outputs 1 on the input 1, 0, 0,1, 1, 
there exists an assignment z € {0, 1}?4 to Y such that 
(Z0; 21 Z2; Z3; Z4) = (1, 0,0, 1, 1) and Y evaluates to 
true on z. 
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Proof of Lemma 15.9. To prove Lemma 15.9 we need to give a reduction 
from NANDSAT to 3NAND. Let Q be a NAND-CIRC program with 
n inputs, one output, and m lines. We can assume without loss of 
generality that Q contains the variables one and zero as usual. 

We map Q to a 3NAND formula W as follows: 


e Y has m +n variables zp, ..., Zmin—1: 


e The first n variables 2p, ..., ,,_; will corresponds to the inputs of Q. 
The next m variables z,,, ... , 2,4. —1 Will correspond to the m lines 


of Q. 


e Forevery l € {n,n + 1,..., n +m}, if the £— n-th line of the program 
Qis foo = NAND(bar,blah) then we add to W the constraint z; = 
NAND (z;, zp) where j — n and k — n correspond to the last lines 
in which the variables bar and blah (respectively) were written to. 
If one or both of bar and blah was not written to before then we 
use z,, instead of the corresponding value z; or z;, in the constraint, 
where lp — n is the line in which zero is assigned a value. If one or 
both of bar and blah is an input variable X[i] then we use z; in the 
constraint. 


e Let ¢* be the last line in which the output y_@ is assigned a value. 
Then we add the constraint z = NAND(z;,, , 22, ) where lọ — n is as 
above the last line in which zero is assigned a value. Note that this 
is effectively the constraint z} = NAND(0,0) = 1. 


To complete the proof we need to show that there exists w € {0, 1}” 
s.t. Q(w) = 1if and only if there exists z € {0,1}"*" that satisfies all 
constraints in Y. We now show both sides of this equivalence. 

Part I: Completeness. Suppose that there is w € {0,1}"s.t. Q(w) = 
1. Letz € {0,1}"*™ be defined as follows: fori € [n], z; = w; and 
fori € {n,n + 1,...,n + m} z; equals the value that is assigned in 
the (i — n)-th line of Q when executed on w. Then by construction 
z satisfies all of the constraints of Y (including the constraint that 
zę = NAND(0,0) = 1 since Q(w) = 1.) 

Part II: Soundness. Suppose that there exists z € {0,1}"'" satisfy- 
ing Y. Soundness will follow by showing that Q(Zo,..., 2,1) = 1 (and 
hence in particular there exists w € {0,1}", namely w = zo“ Z,_1, 
such that Q(w) = 1). To do this we will prove the following claim 
(x): for every £ € [m], 2,,,, equals the value assigned in the £-th step 
of the execution of the program Q on Zp, ... , 2,1. Note that because z 
satisfies the constraints of W, («) is sufficient to prove the soundness 
condition since these constraints imply that the last value assigned to 


the variable y_ in the execution of Q on zo = w,,_, is equal to 1. To 


n—-1 


prove (*) suppose, towards a contradiction, that it is false, and let £ be 
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the smallest number such that z,,,, is not equal to the value assigned 
in the é-th step of the execution of Q on Zp, ...,Z,_;. But since z sat- 
isfies the constraints of Y, we get that z,,,, = NAND(z;, z;) where 

(by the assumption above that £ is smallest with this property) these 
values do correspond to the values last assigned to the variables on the 
right-hand side of the assignment operator in the ¢-th line of the pro- 
gram. But this means that the value assigned in the ¢-th step is indeed 
simply the NAND of z; and z;, contradicting our assumption on the 
choice of £. 


15.5 FROM 3NAND TO 3SAT 
The final step in the proof of Theorem 15.6 is the following: 
Lemma 15.10 3NAND Sp 3SAT. 


Proof Idea: 

To prove Lemma 15.10 we need to map a 3NAND formula y into 
a 3SAT formula w such that ¢ is satisfiable if and only if ọ is. The 
idea is that we can transform every NAND constraint of the form 
a = NAND(b, c) into the AND of ORs involving the variables a, b, c 
and their negations, where each of the ORs contains at most three 
terms. The construction is fairly straightforward, and the details are 
given below. 


x 


# Reduce 3NAND to 3SAT 
# Input: 3NAND formula ¥ 
# Output: 3CNF formula ọ 
# s.t. ọ satisfiable iff ¥ is 
def NAND23SAT_(W): 
= tin 
for (a,b,c) in getnandclauses(¥): 
o += f'(-fa} v +{b} v 7{c}) A ({a} v {fb} v {b}) a (ak v fc} v fel) a! 
return $[:-3] # chop off redundant A 


Ņ = "(xO = NAND(x2,x3) ) A (x3 = NAND(x2,x1) ) A (xl = NAND(x2,x3) ) " 
NAND23SAT_(¥) 


"(4x0 V 3x2 V 7x3) A (xO V x2) V x2) A (xO V x3 V x3) A (4x3 V 7X2 V 7x1) A (x3 V x2) V X2) A (x3 
vo xl Vv xl) A (XI V 7x2 V 7x3) A (X1 V A V x2) A (xl VON x3)! 


Figure 15.7: A 3NAND instance that is obtained by 
taking a NAND-IM program for computing the 
AND function, unrolling it to obtain a NANDSAT 
instance, and then composing it with the reduction of 
Lemma 15.9. 


Figure 15.8: Code and example output for the reduc- 
tion given in Lemma 15.10 of 3NAND to 3SAT. 
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Proof of Lemma 15.10. The constraint 
zı = NAND(z;, zp) (15.2) 


is satisfied if z; = 1 whenever (z;, zp) # (1,1). By going through all 
cases, we can verify that (15.2) is equivalent to the constraint 


(ZV Z V Zp) A (zi V 25) A (zi V Zk) - (15.3) 

Indeed if z; = zp = 1 then the first constraint of Eq. (15.3) is only 
true if z; = 0. On the other hand, if either of Zj OF 2, equals 0 then un- 
less z; = 1 either the second or third constraints will fail. This means 
that, given any 3NAND formula Y over n variables Zo, ... ,Z,_1, we can 
obtain a 3SAT formula ~ over the same variables by replacing every 
3NAND constraint of p with three 3OR constraints as in Eq. (15.3).! 
Because of the equivalence of (15.2) and (15.3), the formula 7 sat- 
isfies that w(zg,..-, 2n-1) = (20; , Zn—1) for every assignment 
Zos- ;Zņn—1 © {0,1}” to the variables. In particular 7 is satisfiable if 
and only if ọ is, thus completing the proof. 


Slide Type 


SAT2IS (NAND23SAT (NANDSAT23NAND(xor ) ) ) 


15.6 WRAPPING UP 


We have shown that for every function F in NP, F < p NANDSAT T 
3 NAND Sy 3SAT, and so 3SAT is NP-hard. Since in Chapter 14 we 
saw that 3SAT <, QUADEQ, 3SAT <, ISET, 3SAT <, MAXCUT 
and 3SAT <, LONGPATH, all these problems are NP-hard as well. 
Finally, since all the aforementioned problems are in NP, they are 

all in fact NP-complete and have equivalent complexity. There are 
thousands of other natural problems that are NP-complete as well. 
Finding a polynomial-time algorithm for any one of them will imply a 
polynomial-time algorithm for all of them. 


1 The resulting formula will have some of the OR’s 
involving only two variables. If we wanted to insist on 
each formula involving three distinct variables we can 
always add a “dummy variable” z,,,,,, and include it 
in all the OR’s involving only two variables, and add a 
constraint requiring this dummy variable to be zero. 


Figure 15.9: An instance of the independent set problem 
obtained by applying the reductions NANDSAT <, 
3NAND <, 3SAT <,, ISAT starting with the xor5 
NAND-CIRC program. 
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Likely Possible Impossilbe Figure 15.10: We believe that P + NP and all NP 
funtion F: (011) > (0.1) | functions F: fo;1 => (0/1) PauctionsF {0,17 = (0.1) complete problems lie outside of P, but we cannot 
eS saci p ER SSE =e as rule out the possiblity that P = NP. However, we 
can rule out the possiblity that some NP-complete 
problems are in P and others are not, since we know 
that if even one NP-complete problem is in P then 
P = NP. The relation between P pory and NP is 
not known though it can be shown that if one NP- 
complete problem is in P poy then NP C P jpoty- 


EXP bom 


Uhaur 


15.7 EXERCISES 


Exercise 15.1 — Poor man’s Ladner’s Theorem. Prove that if there is no 


nls" n) time algorithm for 3SAT then there is some F € NP such > Hint: Use the function F that on input a formula p 
P Fi P lete.2 and a string of the form 1t, outputs 1 if and only if p 

that F ¢ P and F is not NP complete is satisflable and t = jefes, 

n 
Exercise 15.2 — NP + co—NP => NP ¢ P. Let 3SAT be the function 
that on input a 3CNF formula y return 1 — 3SAT(vy). Prove that if 3 Hint: Prove and then use the fact that P is closed 
3SAT ¢ NP then P + NP. See footnote for hint.? under complement. 

m 


Exercise 15.3 Define WSAT to be the following function: the input is a 
CNF formula y where each clause is the OR of one to three variables 
(without negations), and a number k € N. For example, the following 
formula can be used for a valid input to WSAT: p = (z5 V £o V £1) A 
(£1 V £3 V £o) A (£3 V T4 V zo). The output WSAT(ọ, k) = 1 if and 
only if there exists a satisfying assignment to y in which exactly k 

of the variables get the value 1. For example for the formula above 
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WSAT(y, 2) = 1 since the assignment (1, 1,0, 0, 0, 0) satisfies all the 
clauses. However WSAT(y, 1) = 0 since there is no single variable 
appearing in all clauses. 

Prove that WSAT is NP-complete. 


Exercise 15.4 In the employee recruiting problem we are given a list of 
potential employees, each of which has some subset of m potential 
skills, and a number k. We need to assemble a team of k employees 
such that for every skill there would be one member of the team with 
this skill. 

For example, if Alice has the skills “C programming”, “NAND 
programming” and “Solving Differential Equations”, Bob has the 
skills “C programming” and “Solving Differential Equations”, and 
Charlie has the skills “NAND programming” and “Coffee Brewing”, 
then if we want a team of two people that covers all the four skills, we 
would hire Alice and Charlie. 

Define the function EMP s.t. on input the skills L of all potential 


employees (in the form of a sequence L of n lists L,,...,L,,, each 


containing distinct numbers between 0 and m), and a number k, 
EMP(L,k) = 1if and only if there is a subset S of k potential em- 
ployees such that for every skill j in [m], there is an employee in S that 
has the skill j. 


Prove that EMP is NP complete. 


Exercise 15.5 — Balanced max cut. Prove that the “balanced variant” of 
the maximum cut problem is NP-complete, where this is defined as 
BMC : {0,1}* — {0,1} where for every graph G = (V, E) and k € N, 
BMC(G,k) = 1if and only if there exists a cut S in G cutting at least k 
edges such that |S| = |V|/2. 


Exercise 15.6 — Regular expression intersection. Let MANYREGS be the fol- 
lowing function: On input a list of regular expressions expo, ...,exp_, 
(represented as strings in some standard way), output 1 if and only if 
there is a single string x € {0,1}* that matches all of them. Prove that 


MANYREGS is NP-hard. 


15.8 BIBLIOGRAPHICAL NOTES 


Aaronson’s 120 page survey [ Aar16] is a beautiful and extensive ex- 
position to the P vs NP problem, its importance and status. See also 
as well as Chapter 3 in Wigderson’s excellent book [Wig19]. Johnson 
[Joh12] gives a survey of the historical development of the theory of 
NP completeness. The following web page keeps a catalog of failed 
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attempts at settling P vs NP. At the time of this writing, it lists about 
110 papers claiming to resolve the question, of which about 60 claim to 
prove that P = NP and about 50 claim to prove that P + NP. 

Ladner’s Theorem was proved by Richard Ladner in 1975. Lad- 
ner, who was born to deaf parents, later switched his research focus 
into computing for assistive technologies, where he have made many 
contributions. In 2014, he wrote a personal essay on his path from 
theoretical CS to accessibility research. 

Eugene Lawler’s quote on the “mystical power of twoness” was 
taken from the wonderful book “The Nature of Computation” by 
Moore and Mertens. See also this memorial essay on Lawler by 
Lenstra. 


16 
What if P equals NP? 


“You don’t have to believe in God, but you should believe in The Book.”, Paul 
Erdés, 1985.1 


“No more half measures, Walter”, Mike Ehrmantraut in “Breaking Bad”, 
2010. 


“The evidence in favor of [P # NP] and [ its algebraic counterpart | is so 
overwhelming, and the consequences of their failure are so grotesque, that their 
status may perhaps be compared to that of physical laws rather than that of 
ordinary mathematical conjectures.”, Volker Strassen, laudation for Leslie 
Valiant, 1986. 


“Suppose aliens invade the earth and threaten to obliterate it in a year’s time 
unless human beings can find the [fifth Ramsey number]. We could marshal 
the world’s best minds and fastest computers, and within a year we could prob- 
ably calculate the value. If the aliens demanded the [sixth Ramsey number], 
however, we would have no choice but to launch a preemptive attack.”, Paul 
Erdés, as quoted by Graham and Spencer, 1990.2 


We have mentioned that the question of whether P = NP, which 
is equivalent to whether there is a polynomial-time algorithm for 
3SAT, is the great open question of Computer Science. But why is it so 
important? In this chapter, we will try to figure out the implications of 
such an algorithm. 

First, let us get one qualm out of the way. Sometimes people say, 
“What if P = NP but the best algorithm for 3SAT takes n'°°° time?” Well, 
n1000 is much larger than, say, 2°-°°!V” for any input smaller than 2°°, 
as large as a harddrive as you will encounter, and so another way to 
phrase this question is to say “what if the complexity of 3SAT is ex- 
ponential for all inputs that we will ever encounter, but then grows 
much smaller than that?” To me this sounds like the computer science 
equivalent of asking, “what if the laws of physics change completely 
once they are out of the range of our telescopes?”. Sure, this is a valid 
possibility, but wondering about it does not sound like the most pro- 
ductive use of our time. 


Compiled on 12.19.2022 22:58 


Learning Objectives: 


Explore the consequences of P = NP 


Search-to-decision reduction: transform 
algorithms that solve decision version to 
search version for NP-complete problems. 


Optimization and learning problems 


Quantifier elimination and solving problems 
in the polynomial hierarchy. 


What is the evidence for P = NP vs P + NP? 


1 Paul Erdős (1913-1996) was one of the most prolific 
mathematicians of all times. Though he was an athe- 
ist, Erdős often referred to “The Book” in which God 
keeps the most elegant proof of each mathematical 
theorem. 


? The k-th Ramsey number, denoted as R(k, k), is the 
smallest number n such that for every graph G on n 
vertices, both G and its complement contain a k-sized 
independent set. If P = NP then we can compute 
R(k, k) in time polynomial in 2", while otherwise it 
can potentially take closer to 2277 steps. 
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So, as the saying goes, we’ll keep an open mind, but not so open 
that our brains fall out, and assume from now on that: 


e There is a mathematical god, 
and 
e She does not “beat around the bush” or take “half measures”. 


What we mean by this is that we will consider two extreme scenar- 
ios: 


e 3SAT is very easy: 3SAT has an O(n) or O(n”) time algorithm with 
a not too huge constant (say smaller than 10°.) 


e 3SAT is very hard: 3SAT is exponentially hard and cannot be 
solved faster than 2°” for some not too tiny €e > 0 (say at least 
10~°). We can even make the stronger assumption that for every 
sufficiently large n, the restriction of 3SAT to inputs of length n 
cannot be computed by a circuit of fewer than 2°” gates. 


At the time of writing, the fastest known algorithm for 3SAT re- 
quires more than 2°35” to solve n variable formulas, while we do not 
even know how to rule out the possibility that we can compute 3SAT 
using 10n gates. To put it in perspective, for the case n = 1000 our 
lower and upper bounds for the computational costs are apart by 
a factor of about 101°. As far as we know, it could be the case that 
1000-variable 3SAT can be solved in a millisecond on a first-generation 
iPhone, and it can also be the case that such instances require more 
than the age of the universe to solve on the world’s fastest supercom- 
puter. 

So far, most of our evidence points to the latter possibility of 35AT 
being exponentially hard, but we have not ruled out the former possi- 
bility either. In this chapter we will explore some of the consequences 
of the “3SAT easy” scenario. 
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16.1 SEARCH-TO-DECISION REDUCTION 


A priori, having a fast algorithm for 3SAT might not seem so impres- 
sive. Sure, such an algorithm allows us to decide the satisfiability of 
not just 3CNF formulas but also of quadratic equations, as well as find 
out whether there is a long path in a graph, and solve many other de- 
cision problems. But this is not typically what we want to do. It’s not 
enough to know if a formula is satisfiable: we want to discover the 
actual satisfying assignment. Similarly, it’s not enough to find out if a 
graph has a long path: we want to actually find the path. 

It turns out that if we can solve these decision problems, we can 
solve the corresponding search problems as well: 


Theorem 16.1 — Search vs Decision. Suppose that P = NP. Then 
for every polynomial-time algorithm V and a,b € WN, thereisa 
polynomial-time algorithm FIND, such that for every x € {0,1}", 
if there exists y € {0, 1}en? satisfying V (xy) = 1, then FIND, (x) 
finds some string y’ satisfying this condition. 


Proof Idea: 

The idea behind the proof of Theorem 16.1 is simple; let us 
demonstrate it for the special case of 3SAT. (In fact, this case is not 
so “special”— since 3SAT is NP-complete, we can reduce the task of 
solving the search problem for MAXCUT or any other problem in 
NP to the task of solving it for 3SAT.) Suppose that P = NP and we 
are given a satisfiable 3CNF formula y, and we now want to find a 


satisfying assignment y for y. Define 3SAT,() to output 1 if there is 
a satisfying assignment y for p such that its first bit is 0, and similarly 
define 3SAT,(y) = 1 if there is a satisfying assignment y with yọ = 1. 
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The key observation is that both 3SAT, and 3SAT, are in NP, and so if 
P = NP then we can compute them in polynomial time as well. Thus 
we can use this to find the first bit of the satisfying assignment. We 
can continue in this way to recover all the bits. 

* 


Proof of Theorem 16.1. Let V be some polynomial time algorithm and 
a,b € N some constants. Define the function STARTSWITHy, as 
follows: For every x € {0,1}* and z € {0,1}*, STARTSWITH,(z, z) = 
1 if and only if there exists some y € {0, 1}°”’—!*| (where n = |x|) such 
that V (xzy) = 1. That is, STARTSWITH,,/(a, z) outputs 1 if there is 
some string w of length a|z|° such that V(x,w) = 1 and the first |z| 


bits of w are zo, ...,Zp_,. Since, given x, y, z as above, we can check in 
polynomial time if V(xzy) = 1, the function STARTSWITH,, is in NP 
and hence if P = NP we can compute it in polynomial time. 

Now for every such polynomial-time V and a,b € N, we can imple- 
ment FIND, (x) as follows: 


To analyze Algorithm 16.2, note that it makes 2an? invocations to 
STARTSWITH,, and hence if the latter is polynomial-time, then so is 
Algorithm 16.2. Now suppose that x is such that there exists some y 
satisfying V (xy) = 1. We claim that at every step £ = 0,...,an” — 1, we 


maintain the invariant that there exists y € {0,1}2"" whose first £ bits 
are z s.t. V (xy) = 1. Note that this claim implies the theorem, since in 
particular it means that for 0 = an? — 1, z satisfies V (xz) = 1. 

We prove the claim by induction. For £ = 0, this holds vacuously. 
Now for every l > 0, if the call STARTSWITH,, (x2 --- zp_,0) 
returns 1, then we are guaranteed the invariant by definition of 
STARTSWITH,,. Now under our inductive hypothesis, there is 
Yor +s Yan Such that P(xZ, .-. , Zp_1Ye. +s Yano—1) = 1. If the call to 
STARTSWITHy (x2 -*: Zp_,0) returns 0 then it must be the case that 
ye = 1, and hence when we set z; = 1 we maintain the invariant. 


16.2 OPTIMIZATION 


Theorem 16.1 allows us to find solutions for NP problems if P = NP, 
but it is not immediately clear that we can find the optimal solution. 
For example, suppose that P = NP, and you are given a graph G. Can 
you find the longest simple path in G in polynomial time? 


The answer is Yes. The idea is simple: if P = NP then we can find 
out in polynomial time if an n-vertex graph G contains a simple path 


of length n, and moreover, by Theorem 16.1, if G does contain such a 
path, then we can find it. (Can you see why?) If G does not contain a 
simple path of length n, then we will check if it contains a simple path 
of length n — 1, and continue in this way to find the largest k such that 
G contains a simple path of length k. 

The above reasoning was not specifically tailored to finding paths 
in graphs. In fact, it can be vastly generalized to proving the following 
result: 


Theorem 16.3 — Optimization from P = NP. Suppose that P = NP. Then 
for every polynomial-time computable function f : {0,1} — N 
(identifying f(x) with natural numbers via the binary representa- 
tion) there is a polynomial-time algorithm OPT such that on input 
x € {0,1}*, 


OPT(z,1™) = i 
(x,1™) ie oe) 
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Moreover under the same assumption, there is a polynomial- 
time algorithm FINDOPT such that for every x € {0,1}*, FINDOPT(z, 1”) 
outputs y* € {0, 1}* such that f(x, y*) = maxycio14m f(x,y). 


® 


Proof Idea: 

The proof follows by generalizing our ideas from the longest path 
example above. Let f be as in the theorem statement. If P = NP then 
for every for every string x € {0,1}* and number k, we can test in 
poly(|x|, m) time whether there exists y such that f(x,y) > k,orin 
other words test whether max,co 1y» f(x,y) = k. If f(x,y) is an 
integer between 0 and poly(|z| + |y|) (as is the case in the example of 
longest path) then we can just try out all possibilities for k to find the 
maximum number k for which max, f(x,y) = k. Otherwise, we can 
use binary search to hone down on the right value. Once we do so, we 
can use search-to-decision to actually find the string y* that achieves 
the maximum. 

* 


Proof of Theorem 16.3. For every f as in the theorem statement, we can 
define the Boolean function F : {0,1}* > {0,1} as follows. 


$ Aye(o.rym f (2,9) > k 
otherwise 

Since f is computable in polynomial time, F is in NP, and so under 
our assumption that P = NP, F itself can be computed in polynomial 
time. Now, for every x and m, we can compute the largest k such that 
F(x,1™,k) = 1 bya binary search. Specifically, we will do this as 
follows: 


. We maintain two numbers a, b such that we are guaranteed that 
a < MAaXyexo13m f(x,y) < b. 


. Initially we set a = 0 and b = 27” where T(n) is the running time 
of f. (A function with T(n) running time can’t output more than 
T(n) bits and so can’t output a number larger than 27%.) 


. At each point in time, we compute the midpoint c = [(a + 6)/2]) 
and let y = F'(1",c). 


a. Ify = 1 then we set a = c and leave b as it is. 


b. Ify = 0 then we set b = c and leave a as it is. 
. We then go back to step 3, until b < a +1. 


Since |b — a| shrinks by a factor of 2, within log, 2" = T(n) 
steps, we will get to the point at which b < a + 1, and then we can 
simply output a. Once we find the maximum value of k such that 
F(a,1™,k) = 1, we can use the search to decision reduction of Theo- 
rem 16.1 to obtain the actual value y* € {0,1} such that f(a, y*) = k. 
a 


m Example 16.4 — Integer programming. One application for Theo- 

rem 16.3 is in solving optimization problems. For example, the task 

of linear programming is to find y € R” that maximizes some linear 
objective = ee ciy; Subject to the constraint that y satisfies linear 
inequalities of the form ie a,y; < c. As we discussed in Sec- 
tion 12.1.3, there is a known polynomial-time algorithm for linear 
programming. However, if we want to place additional constraints 
on y, such as requiring the coordinates of y to be integer or 0/1 
valued then the best-known algorithms run in exponential time in 
the worst case. However, if P = NP then Theorem 16.3 tells us 
that we would be able to solve all problems of this form in poly- 
nomial time. For every string x that describes a set of constraints 
and objective, we will define a function f such that if y satisfies 

the constraints of x then f(x, y) is the value of the objective, and 
otherwise we set f(x,y) = —M where M is some large number. We 
can then use Theorem 16.3 to compute the y that maximizes f(z, y) 
and that will give us the assignment for the variables that satisfies 
our constraints and maximizes the objective. (If the computation 
results in y such that f(x,y) = —M then we can double M and try 
again; if the true maximum objective is achieved by some string 

y*, then eventually M will be large enough so that —M would be 
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procedure of Theorem 16.3 we would get a value larger than —M.) 


| smaller than the objective achieved by y*, and hence when we run 


16.2.1 Example: Supervised learning 

One classical optimization task is supervised learning. In supervised 
learning we are given a list of examples xo, X1,..., £ m—1 (where we 
can think of each z; as a string in {0, 1}” for some n) and the la- 
bels for them Yo, .-. , Ym—1 (which we will think of simply bits, i.e., 
yi € {0,1}). For example, we can think of the x,’s as images of ei- 
ther dogs or cats, for which y; = 1 in the former case and y; = 0 
in the latter case. Our goal is to come up with a hypothesis or predic- 
tor h : {0,1}” — {0,1} such that if we are given a new example x 
that has an (unknown to us) label y, then with high probability h 
will predict the label. That is, with high probability it will hold that 
h(x) = y. The idea in supervised learning is to use the Occam’s Ra- 
zor principle: the simplest hypothesis that explains the data is likely 
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to be correct. There are several ways to model this, but one popular 
approach is to pick some fairly simple function H : {0,1}**" — {0,1}. 
We think of the first k inputs as the parameters and the last n inputs 
as the example data. (For example, we can think of the first k inputs 
of H as specifying the weights and connections for some neural net- 
work that will then be applied on the latter n inputs.) We can then 
phrase the supervised learning problem as finding, given a set of la- 
beled examples S = {(2 9, Yo), ---;(Lm_—1;Ym_—1) }, the set of parameters 
69, -+-;9,-1 © {0, 1} that minimizes the number of errors made by the 
predictor x ++ H(6,x).3 

In other words, we can define for every set S as above the function 
F; : {0,1}* — [m] such that Fs(0) = Lienes H0, x) — yl. Now, 
finding the value @ that minimizes F(0) is equivalent to solving the 


3 This is often known as Empirical Risk Minimization. 


supervised learning problem with respect to H. For every polynomial- 
time computable H : {0,1}**" — {0,1}, the task of minimizing 
Fs(0) can be “massaged” to fit the form of Theorem 16.3 and hence if 
P = NP, then we can solve the supervised learning problem in great 
generality. In fact, this observation extends to essentially any learn- 
ing model, and allows for finding the optimal predictors given the 
minimum number of examples. (This is in contrast to many current 
learning algorithms, which often rely on having access to an extremely 
large number of examples— far beyond the minimum needed, and 

in particular far beyond the number of examples humans use for the 
same tasks.) 


16.2.2 Example: Breaking cryptosystems 

We will discuss cryptography later in this course, but it turns out that 
if P = NP then almost every cryptosystem can be efficiently bro- 
ken. One approach is to treat finding an encryption key as an in- 
stance of a supervised learning problem. If there is an encryption 
scheme that maps a “plaintext” message p and a key 0 to a “cipher- 
text” c, then given examples of ciphertext/plaintext pairs of the 
form (Cg, Po), +++ + (Cm—1;Pm-—1), Our goal is to find the key 0 such that 
E(0,p;) = c; where E is the encryption algorithm. While you might 
think getting such “labeled examples” is unrealistic, it turns out (as 
many amateur home-brew crypto designers learn the hard way) that 
this is actually quite common in real-life scenarios, and that it is also 
possible to relax the assumption to having more minimal prior infor- 
mation about the plaintext (e.g., that it is English text). We defer a 
more formal treatment to Chapter 21. 


16.3 FINDING MATHEMATICAL PROOFS 


In the context of Gédel’s Theorem, we discussed the notion of a proof 
system (see Section 11.1). Generally speaking, a proof system can be 
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thought of as an algorithm V : {0,1}* — {0, 1} (known as the verifier) 
such that given a statement x € {0,1}* and a candidate proof w € {0,1}*, 
V(a,w) = 1 if and only if w encodes a valid proof for the statement z. 
Any type of proof system that is used in mathematics for geometry, 
number theory, analysis, etc., is an instance of this form. In fact, stan- 
dard mathematical proof systems have an even simpler form where 
the proof w encodes a sequence of lines w®, ..., w™ (each of which is 
itself a binary string) such that each line w” is either an axiom or fol- 
lows from some prior lines through an application of some inference 
rule. For example, Peano’s axioms encode a set of axioms and rules 
for the natural numbers, and one can use them to formalize proofs 
in number theory. Also, there are some even stronger axiomatic sys- 
tems, the most popular one being Zermelo-Fraenkel with the Axiom 
of Choice or ZFC for short. Thus, although mathematicians typically 
write their papers in natural language, proofs of number theorists 
can typically be translated to ZFC or similar systems, and so in par- 
ticular the existence of an n-page proof for a statement x implies that 
there exists a string w of length poly(n) (in fact often O(n) or O(n?)) 
that encodes the proof in such a system. Moreover, because verify- 
ing a proof simply involves going over each line and checking that it 
does indeed follow from the prior lines, it is fairly easy to do that in 
O(|w|) or O(|w|?) (where as usual |w| denotes the length of the proof 
w). This means that for every reasonable proof system V, the follow- 
ing function SHORTPROOF,, : {0,1}* — {0,1} is in NP, where 
for every input of the form 71”, SHORTPROOF,(z,1™) = 1 if and 
only if there exists w € {0,1}* with |w| < m s.t. V(xw) = 1. That 
is, SHORTPROOF,,(a, 1”) = 1 if there is a proof (in the system V) 
of length at most m bits that x is true. Thus, if P = NP, then despite 
Gédel’s Incompleteness Theorems, we can still automate mathematics 
in the sense of finding proofs that are not too long for every statement 
that has one. (Frankly speaking, if the shortest proof for some state- 
ment requires a terabyte, then human mathematicians won't ever find 
this proof either.) For this reason, Gödel himself felt that the question 
of whether SHORTPROOF,, has a polynomial time algorithm is of 
great interest. As Gödel wrote in a letter to John von Neumann in 1956 
(before the concept of NP or even “polynomial time” was formally 
defined): 

One can obviously easily construct a Turing machine, which for every 

formula F in first order predicate logic and every natural number n, al- 

lows one to decide if there is a proof of F of length n (length = number 

of symbols). Let y(F, n) be the number of steps the machine requires 

for this and let p(n) = max, Y(F,n). The question is how fast y(n) 

grows for an optimal machine. One can show that y > k - n [for some 

constant k > 0]. If there really were a machine with y(n) ~ k - n (or 

even ~ k: n’), this would have consequences of the greatest importance. 


Namely, it would obviously mean that in spite of the undecidability 

of the Entscheidungsproblem,4 the mental work of a mathematician 
concerning Yes-or-No questions could be completely replaced by a ma- 
chine. After all, one would simply have to choose the natural number 
n so large that when the machine does not deliver a result, it makes no 
sense to think more about the problem. 


For many reasonable proof systems (including the one that Gédel 
referred to), SHORTPROOF,, is in fact NP-complete, and so Gödel can 
be thought of as the first person to formulate the P vs NP question. 
Unfortunately, the letter was only discovered in 1988. 


16.4 QUANTIFIER ELIMINATION (ADVANCED) 


If P = NP then we can solve all NP search and optimization problems in 
polynomial time. But can we do more? It turns out that the answer is 
that Yes we can! 

An NP decision problem can be thought of as the task of deciding, 
given some string x € {0, 1}* the truth of a statement of the form 


Ayefo,1}0(2) V (£y) =1 


for some polynomial-time algorithm V and polynomial p : N > N. 
That is, we are trying to determine, given some string x, whether 
there exists a string y such that x and y satisfy some polynomial-time 
checkable condition V. For example, in the independent set problem, 
the string « represents a graph G and a number k, the string y repre- 
sents some subset S of G’s vertices, and the condition that we check is 
whether |S| > k and there is no edge {u, v} in G such that both u € S$ 
and v € S. 

We can consider more general statements such as checking, given a 
string x € {0,1}*, the truth of a statement of the form 


ye{0,1}P0(l#) Y ze{0,1}71 02) V (zyz) = 1 ; (16.1) 


which in words corresponds to checking, given some string x, whether 
there exists a string y such that for every string z, the triple (x, y, z) sat- 
isfy some polynomial-time checkable condition. We can also consider 
more levels of quantifiers such as checking the truth of the statement 


Aye {o,1}70(l#) Y 2€{0,1}71(l# Swe {o,1}72(ie) V (wyzw) =1 (16.2) 


and so on and so forth. 

For example, given an n-input NAND-CIRC program P, we might 
want to find the smallest NAND-CIRC program P’ that computes the 
same function as P. The question of whether there is such a P’ that 
can be described by a string of at most s bits can be phrased as 


P'e{o,1}s Vxefo,1}n P(E) = P' (x) (16.3) 
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* The undecidability of Entscheidungsproblem refers 
to the uncomputability of the function that maps a 
statement in first order logic to 1 if and only if that 
statement has a proof. 
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which has the form (16.1).° Another example of a statement involving 


a levels of quantifiers would be to check, given a chess position x, 
whether there is a strategy that guarantees that White wins within a 


steps. For example is a = 3 we would want to check if given the board 
position x, there exists a move y for White such that for every move z for 


Black there exists a move w for White that ends in a a checkmate. 
It turns out that if P = NP then we can solve these kinds of prob- 
lems as well: 


Theorem 16.6 — Polynomial hierarchy collapse. If P = NP then for every 
a €E N,polynomialp : N — Nand polynomial-time algorithm 
V, there is a polynomial-time algorithm SOLVEy , that on input 

x € {0, 1}” returns 1 if and only if 


Ay,<{0,1}" Vy, €{0,1}" Ei Oy e{0,1}7 V (£Y091 c a= (16.4) 


wherem = p(n) and Q is either 3 or V depending on whether a is 
odd or even, respectively. 


Proof Idea: 

To understand the idea behind the proof, consider the special case 
where we want to decide, given x € {0, 1}”, whether for every y € 
{0, 1}” there exists z € {0, 1}” such that V (xyz) = 1. Consider the 
function F such that F (xy) = 1 if there exists z € {0,1}” such that 
V(xyz) = 1. Since V runs in polynomial-time F € NP and hence if 
P = NP, then there is an algorithm V’ that on input x, y outputs 1 if 
and only if there exists z € {0,1}” such that V (xyz) = 1. Now we 


can see that the original statement we consider is true if and only if for 


every y E€ {0,1}", V’(zy) = 1, which means it is false if and only if 
the following condition (+) holds: there exists some y € {0,1}” such 
that V’(cy) = 0. But for every x € {0,1}”, the question of whether 
the condition (*) is itself in NP (as we assumed V’ can be computed 
in polynomial time) and hence under the assumption that P = NP 
we can determine in polynomial time whether the condition (*), and 
hence our original statement, is true. 
* 


Proof of Theorem 16.6. We prove the theorem by induction. We assume 


that there is a polynomial-time algorithm SOLVEy, ,_, that can solve 
the problem (16.4) for a — 1 and use that to solve the problem for a. 


For a = 1, SOLVEy, ,_;(x) = 1 iff V(x) = 1 which is a polynomial-time 


computation since V runs in polynomial time. For every 2, yọ, define 
the statement ¢,,,,, to be the following: 


é For the ease of notation, we assume that all the 
strings we quantify over have the same length m = 
p(n), but using simple padding one can show that 
this captures the general case of strings of different 
polynomial lengths. 


Pr yo T Vy, e{0,1}” Jy, €{0,1}™ Oy eto, V (YoY Yq 4) =1 


By the definition of SOLVE, ,, for every x € {0,1}", our goal is 
that SOLVE, a(x) = 1 if and only if there exists yọ € {0,1} such that 
Poy, İS true. 

The negation of p, „, is the statement 


Piia = y, €{0,1}™ V yoe{O,1}™ + Qy cory V (YoY Ya) =0 


where Q is 3 if Q was V and Q is V otherwise. (Please stop and verify 
that you understand why this is true, this is a generalization of the fact 


that if Y is some logical condition then the negation of 4,V,W(y, z) is 


v3; Ww (y, 2)-) 


The crucial observation is that Pry, IS exactly a statement of the 


form we consider with a — 1 quantifiers instead of a, and hence by 
our inductive hypothesis there is some polynomial time algorithm 

S that on input xyp outputs 1 if and only if p, „is true. If we let S 

be the algorithm that on input z, yy outputs 1 — S (xy) then we see 
that S outputs 1 if and only if p, „, is true. Hence we can rephrase the 


original statement (16.4) as follows: 


Ay e{o,1}"9(LYo) = 1 (16.5) 


but since S is a polynomial-time algorithm, Eq. (16.5) is clearly a 


statement in NP and hence under our assumption that P = NP there is 
a polynomial time algorithm that on input x € {0,1}", will determine 
if (16.5) is true and so also if the original statement (16.4) is true. 

E 


The algorithm of Theorem 16.6 can also solve the search problem 
as well: find the value yp that certifies the truth of (16.4). We note 
that while this algorithm is in polynomial time, the exponent of this 
polynomial blows up quite fast. If the original NANDSAT algorithm 
required 2(n”) time, solving a levels of quantifiers would require time 
O(n?" ).7 


16.4.1 Application: self improving algorithm for 3SAT 

Suppose that we found a polynomial-time algorithm A for 3SAT that 
is “good but not great”. For example, maybe our algorithm runs in 
time cn? for some not too small constant c. However, it’s possible 
that the best possible SAT algorithm is actually much more efficient 
than that. Perhaps, as we guessed before, there is a circuit C* of at 
most 10°n gates that computes 3SAT on n variables, and we simply 
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7 We do not know whether such loss is inherent. 

As far as we can tell, it’s possible that the quantified 
boolean formula problem has a linear-time algorithm. 
We will, however, see later in this course that it 
satisfies a notion known as PSPACE-hardness that is 
even stronger than NP-hardness. 
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haven't discovered it yet. We can use Theorem 16.6 to “bootstrap” our 
original “good but not great” 3SAT algorithm to discover the optimal 
one. The idea is that we can phrase the question of whether there 
exists a size s circuit that computes 3SAT for all length n inputs as 
follows: there exists a size < s circuit C such that for every formula » 
described by a string of length at most n, if C (p) = 1 then there exists 
an assignment « to the variables of ọ that satisfies it. One can see that 
this is a statement of the form (16.2) and hence if P = NP we can solve 
it in polynomial time as well. We can therefore imagine investing huge 
computational resources in running A one time to discover the circuit 
C* and then using C* for all further computation. 


16.5 APPROXIMATING COUNTING PROBLEMS AND POSTERIOR 
SAMPLING (ADVANCED, OPTIONAL) 


Given a Boolean circuit C, if P = NP then we can find an input x (if 
one exists) such that C(x) = 1. But what if there is more than one x 
like that? Clearly we can’t efficiently output all such 2’s; there might 
be exponentially many. But we can get an arbitrarily good multiplica- 
tive approximation (i.e., a 1e factor for arbitrarily small e > 0) for the 


number of such 2’s, as well as output a (nearly) uniform member of 
this set. The details are beyond the scope of this book, but this result is 
formally stated in the following theorem (whose proof is omitted). 


Theorem 16.7 — Approximate counting if P = NP. Let V : {0,1}* — {0,1} 
be some polynomial-time algorithm, and suppose thatP = NP. 
Then there exists an algorithm COUNT» that on input x, 1", e€, 
runs in time polynomial in |z|,m,1/e and outputs a number in 

[2™ + 1] satisfying 


(1—e)COUNTy (a, m,e) < |{y € {0,1}™ : V(ay) = 1}| < (1+e)COUNT, (a, m,€) . 


In other words, the algorithm COUNT, gives an approximation 
up to a factor of 1 + e€ for the number of witnesses for x with respect 


to the verifying algorithm V. Once again, to understand this theorem 
it can be useful to see how it implies that if P = NP then there isa 
polynomial-time algorithm that given a graph G and a number k, 
can compute a number K that is within a 1 + 0.01 factor equal to the 


number of simple paths in G of length k. (That is, K is between 0.99 to 
1.01 times the number of such paths.) 


Posterior sampling and probabilistic programming. The algorithm for count- 
ing can also be extended to sampling from a given posterior distri- 
bution. That is, if C : {0,1}” — {0,1} is a Boolean circuit and 
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y € {0,1}™, then if P = NP we can sample from (a close approx- 
imation of) the distribution of uniform x € {0,1}”" conditioned on 
C(x) = y. This task is known as posterior sampling and is crucial for 
Bayesian data analysis. These days it is known how to achieve pos- 
terior sampling only for circuits C of very special form, and even in 
these cases more often than not we do have guarantees on the quality 
of the sampling algorithm. The field of making inferences by sampling 
from posterior distribution specified by circuits or programs is known 
as probabilistic programming. 


16.6 WHAT DOES ALL OF THIS IMPLY? 


So, what will happen if we have a 10°n algorithm for 3SAT? We have 
mentioned that NP-hard problems arise in many contexts, and indeed 
scientists, engineers, programmers and others routinely encounter 
such problems in their daily work. A better 3SAT algorithm will prob- 
ably make their lives easier, but that is the wrong place to look for 
the most foundational consequences. Indeed, while the invention of 
electronic computers did of course make it easier to do calculations 
that people were already doing with mechanical devices and pen and 
paper, the main applications computers are used for today were not 
even imagined before their invention. 

An exponentially faster algorithm for all NP problems would be 
no less radical an improvement (and indeed, in some sense would 
be more) than the computer itself, and it is as hard for us to imagine 
what it would imply as it was for Babbage to envision today’s world. 
For starters, such an algorithm would completely change the way we 
program computers. Since we could automatically find the “best” 
(in any measure we chose) program that achieves a certain task, we 
would not need to define how to achieve a task, but only specify tests 
as to what would be a good solution, and could also ensure that a 
program satisfies an exponential number of tests without actually 
running them. 

The possibility that P = NP is often described as “automating 
creativity”. There is something to that analogy, as we often think of 
a creative solution as one that is hard to discover but that, once the 
“spark” hits, is easy to verify. But there is also an element of hubris 
to that statement, implying that the most impressive consequence of 
such an algorithmic breakthrough will be that computers would suc- 
ceed in doing something that humans already do today. Nevertheless, 
artificial intelligence, like many other fields, will clearly be greatly 
impacted by an efficient 3SAT algorithm. For example, it is clearly 
much easier to find a better Chess-playing algorithm when, given any 
algorithm P, you can find the smallest algorithm P’ that plays Chess 
better than P. Moreover, as we mentioned above, much of machine 
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learning (and statistical reasoning in general) is about finding “sim- 
ple” concepts that explain the observed data, and if NP = P, we could 
search for such concepts automatically for any notion of “simplicity” 
we see fit. In fact, we could even “skip the middle man” and do an 
automatic search for the learning algorithm with smallest general- 
ization error. Ultimately the field of Artificial Intelligence is about 
trying to “shortcut” billions of years of evolution to obtain artificial 
programs that match (or beat) the performance of natural ones, and a 
fast algorithm for NP would provide the ultimate shortcut.® 

More generally, a faster algorithm for NP problems would be im- 
mensely useful in any field where one is faced with computational or 
quantitative problems— which is basically all fields of science, math, 
and engineering. This will not only help with concrete problems such 
as designing a better bridge, or finding a better drug, but also with 
addressing basic mysteries such as trying to find scientific theories or 
“laws of nature”. In a fascinating talk, physicist Nima Arkani-Hamed 
discusses the effort of finding scientific theories in much the same lan- 
guage as one would describe solving an NP problem, for which the 
solution is easy to verify or seems “inevitable”, once found, but that 
requires searching through a huge landscape of possibilities to reach, 
and that often can get “stuck” at local optima: 


“the laws of nature have this amazing feeling of inevitability... which is associ- 
ated with local perfection.” 


“The classical picture of the world is the top of a local mountain in the space of 
ideas. And you go up to the top and it looks amazing up there and absolutely 
incredible. And you learn that there is a taller mountain out there. Find it, 
Mount Quantum... they're not smoothly connected ... you've got to make a 
jump to go from classical to quantum ... This also tells you why we have such 
major challenges in trying to extend our understanding of physics. We don’t 
have these knobs, and little wheels, and twiddles that we can turn. We have to 
learn how to make these jumps. And it is a tall order. And that’s why things are 
difficult.” 


Finding an efficient algorithm for NP amounts to always being able 
to search through an exponential space and find not just the “local” 
mountain, but the tallest peak. 

But perhaps more than any computational speedups, a fast algo- 
rithm for NP problems would bring about a new type of understanding. 
In many of the areas where NP-completeness arises, it is not as much 
a barrier for solving computational problems as it is a barrier for ob- 
taining “closed-form formulas” or other types of more constructive 
descriptions of the behavior of natural, biological, social and other sys- 
tems. A better algorithm for NP, even if it is “merely” 2V"_time, seems 
to require obtaining a new way to understand these types of systems, 
whether it is characterizing Nash equilibria, spin-glass configurations, 


8 One interesting theory is that P = NP and evolution 
has already discovered this algorithm, which we are 
already using without realizing it. At the moment, 
there seems to be very little evidence for such a sce- 
nario. In fact, we have some partial results in the 
other direction showing that, regardless of whether 
P = NP, many types of “local search” or “evolution- 
ary” algorithms require exponential time to solve 
3SAT and other NP-hard problems. 
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entangled quantum states, or any of the other questions where NP is 
currently a barrier for analytical understanding. Such new insights 
would be very fruitful regardless of their computational utility. 


16.7 CAN P + NP BE NEITHER TRUE NOR FALSE? 


The Continuum Hypothesis is a conjecture made by Georg Cantor in 
1878, positing the non-existence of a certain type of infinite cardinality. 
(One way to phrase it is that for every infinite subset S of the real 
numbers R, either there is a one-to-one and onto function f : S > R 
or there is a one-to-one and onto function f : S — N.) This was 
considered one of the most important open problems in set theory, 
and settling its truth or falseness was the first problem put forward by 
Hilbert in the 1900 address we mentioned before. However, using the 
theories developed by Gédel and Turing, in 1963 Paul Cohen proved 
that both the Continuum Hypothesis and its negation are consistent 
with the standard axioms of set theory (i.e., the Zermelo-Fraenkel 
axioms + the Axiom of choice, or “ZFC” for short). Formally, what 
he proved is that if ZFC is consistent, then so is ZFC when we assume 
either the continuum hypothesis or its negation. 

Today, many (though not all) mathematicians interpret this result 
as saying that the Continuum Hypothesis is neither true nor false, but 
rather is an axiomatic choice that we are free to make one way or the 
other. Could the same hold for P + NP? 

In short, the answer is No. For example, suppose that we are try- 
ing to decide between the “3SAT is easy” conjecture (there is an 10°n 
time algorithm for 3SAT) and the “3SAT is hard” conjecture (for ev- 
ery n, any NAND-CIRC program that solves n variable 3SAT takes 
210°" lines). Then, since for n = 108, 2!°°" > 10°n, this boils down 
to the finite question of deciding whether or not there is a 10'?-line 
NAND-CIRC program deciding 3SAT on formulas with 108 variables. 
If there is such a program then there is a finite proof of its existence, 
namely the approximately 1TB file describing the program, and for 


which the verification is the (finite in principle though infeasible in ° This inefficiency is not necessarily inherent. Later 
practice) process of checking that it succeeds on all inputs.’ If there in this course se may discuss'resultsin programi 
D f A checking, interactive proofs, and average-case com- 
isn’t such a program, then there is also a finite proof of that, though plexity, that can be used for efficient verification of 
any such proof would take longer since we would need to enumer- proofs of related statements. In contrast, the ineffi- 
ciency of verifying failure of all programs could well 


ate over all programs as well. Ultimately, since it boils down to a finite ie eee 


statement about bits and numbers; either the statement or its negation 
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must follow from the standard axioms of arithmetic in a finite number 
of arithmetic steps. Thus, we cannot justify our ignorance in distin- 
guishing between the “3SAT easy” and “3SAT hard” cases by claiming 
that this might be an inherently ill-defined question. Similar reason- 
ing (with different numbers) applies to other variants of the P vs NP 
question. We note that in the case that 3SAT is hard, it may well be 
that there is no short proof of this fact using the standard axioms, and 
this is a question that people have been studying in various restricted 
forms of proof systems. 


16.8 IS P = NP “IN PRACTICE”? 


The fact that a problem is NP-hard means that we believe there is no 
efficient algorithm that solves it in the worst case. It does not, however, 
mean that every single instance of the problem is hard. For exam- 

ple, if all the clauses in a 3SAT instance y contain the same variable 

x, (possibly in negated form), then by guessing a value to z; we can 
reduce to a 2SAT instance which can then be efficiently solved. Gen- 
eralizations of this simple idea are used in “SAT solvers”, which are 
algorithms that have solved certain specific interesting SAT formulas 
with thousands of variables, despite the fact that we believe SAT to 

be exponentially hard in the worst case. Similarly, a lot of problems 
arising in economics and machine learning are NP-hard.!° And yet 
vendors and customers manage to figure out market-clearing prices 
(as economists like to point out, there is milk on the shelves) and mice 
succeed in distinguishing cats from dogs. Hence people (and ma- 
chines) seem to regularly succeed in solving interesting instances of 
NP-hard problems, typically by using some combination of guessing 
while making local improvements. 

It is also true that there are many interesting instances of NP-hard 
problems that we do not currently know how to solve. Across all ap- 
plication areas, whether it is scientific computing, optimization, con- 
trol or more, people often encounter hard instances of NP problems 
on which our current algorithms fail. In fact, as we will see, all of our 
digital security infrastructure relies on the fact that some concrete and 
easy-to-generate instances of, say, 35AT (or, equivalently, any other 
NP-hard problem) are exponentially hard to solve. 

Thus it would be wrong to say that NP is easy “in practice”, nor 
would it be correct to take NP-hardness as the “final word” on the 
complexity of a problem, particularly when we have more informa- 
tion about how any given instance is generated. Understanding both 
the “typical complexity” of NP problems, as well as the power and 
limitations of certain heuristics (such as various local-search based al- 
gorithms) is a very active area of research. We will see more on these 
topics later in this course. 


10 Actually, the computational difficulty of problems in 
economics such as finding optimal (or any) equilibria 
is quite subtle. Some variants of such problems are 
NP-hard, while others have a certain “intermediate” 
complexity. 
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16.9 WHAT IF P 4 NP? 


So, P = NP would give us all kinds of fantastical outcomes. But we 
strongly suspect that P # NP, and moreover that there is no much- 
better-than-brute-force algorithm for 3SAT. If indeed that is the case, is 
it all bad news? 

One might think that impossibility results, telling you that you 
cannot do something, is the kind of cloud that does not have a silver 
lining. But in fact, as we already alluded to before, it does. A hard 
(in a sufficiently strong sense) problem in NP can be used to create a 
code that cannot be broken, a task that for thousands of years has been 
the dream of not just spies but of many scientists and mathematicians 
over the generations. But the complexity viewpoint turned out to 
yield much more than simple codes, achieving tasks that people had 
previously not even dared to dream of. These include the notion of 
public key cryptography, allowing two people to communicate securely 
without ever having exchanged a secret key; electronic cash, allowing 
private and secure transaction without a central authority; and secure 
multiparty computation, enabling parties to compute a joint function on 
private inputs without revealing any extra information about it. Also, 
as we will see, computational hardness can be used to replace the role 
of randomness in many settings. 

Furthermore, while it is often convenient to pretend that computa- 
tional problems are simply handed to us, and that our job as computer 
scientists is to find the most efficient algorithm for them, this is not 
how things work in most computing applications. Typically even for- 
mulating the problem to solve is a highly non-trivial task. When we 
discover that the problem we want to solve is NP-hard, this might be a 
useful sign that we used the wrong formulation for it. 

Beyond all these, the quest to understand computational hardness 
— including the discoveries of lower bounds for restricted compu- 
tational models, as well as new types of reductions (such as those 
arising from “probabilistically checkable proofs”) — has already had 
surprising positive applications to problems in algorithm design, as 
well as in coding for both communication and storage. This is not 
surprising since, as we mentioned before, from group theory to the 
theory of relativity, the pursuit of impossibility results has often been 
one of the most fruitful enterprises of mankind. 


© Chapter Recap 


e The question of whether P = NPisoneof the 
most important and fascinating questions of com- 
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" Talk more about coping with NP hardness. Main 
two approaches are heuristics such as SAT solvers that 
succeed on some instances, and proxy measures such 
as mathematical relaxations that instead of solving 
problem X (e.g., an integer program) solve program 
X’ (e.g. a linear program) that is related to that. 
Maybe give compressed sensing as an example, and 
least square minimization as a proxy for maximum 
apostoriori probability. 
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16.10 EXERCISES 


16.11 BIBLIOGRAPHICAL NOTES 


As mentioned before, Aaronson’s survey [Aar16] is a great exposition 
of the P vs NP problem. Another recommended survey by Aaronson 
is [Aar05] which discusses the question of whether NP complete 
problems could be computed by any physical means. 

The paper [BU11] discusses some results about problems in the 
polynomial hierarchy. 


17 
Space bounded computation 


PLAN: Example of space bounded algorithms, importance of pre- 
serving space. The classes L and PSPACE, space hierarchy theorem, 
PSPACE=NPSPACE, constant space = regular languages. 


17.1 EXERCISES 
17.2 BIBLIOGRAPHICAL NOTES 
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RANDOMIZED COMPU 


Learning Objectives: 


Review the basic notion of probability theory 
that we will use. 


Sample spaces, and in particular the space 
{0,1}” 


Events, probabilities of unions and 
intersections. 


Random variables and their expectation, 
variance, and standard deviation. 


1 8 Independence and correlation for both events 
and random variables. 
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variable will deviate from its expectation). 


“God doesn't play dice with the universe”, Albert Einstein 


“Einstein was doubly wrong ... not only does God definitely play dice, but He 
sometimes confuses us by throwing them where they can’t be seen.”, Stephen 
Hawking 


“ ‘The probability of winning a battle has no place in our theory because it 
does not belong to any [random experiment]. Probability cannot be applied 
to this problem any more than the physical concept of work can be applied to 
the ‘work’ done by an actor reciting his part.”, Richard Von Mises, 1928 
(paraphrased) 


“Lam unable to see why ‘objectivity’ requires us to interpret every probability 
as a frequency in some random experiment; particularly when in most problems 
probabilities are frequencies only in an imaginary universe invented just for the 
purpose of allowing a frequency interpretation.”, E.T. Jaynes, 1976 


Before we show how to use randomness in algorithms, let us do a 
quick review of some basic notions in probability theory. This is not 
meant to replace a course on probability theory, and if you have not 
seen this material before, I highly recommend you look at additional 
resources to get up to speed. Fortunately, we will not need many of 
the advanced notions of probability theory, but, as we will see, even 
the so-called “simple” setting of tossing n coins can lead to very subtle 
and interesting issues. 
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2. An event, which is simply a subset of the sample space, 
with the probability of the event happening being the 
fraction of outcomes that are in this subset. 


3. A random variable, which is a way to assign some number 
or statistic to an outcome of the sample space. 


4. The notion of conditioning, which corresponds to how the 
value of a random variable (or the probability of an event) 
changes if we restrict attention to outcomes for which the 
value of another variable is known (or for which some 
other event has happened). Random variables and events 
that have no impact on one another are called independent. 


5. Expectation, which is the average of a random variable, 
and concentration bounds which quantify the probability 
that a random variable can “stray too far” from its ex- 
pected value. 


These concepts are at once both basic and subtle. While 

we will not need many “fancy” topics covered in statistics 
courses, including special distributions (e.g., gemoetric, 
Poisson, exponential, Gaussian, etc.), nor topics such as hy- 
pothesis testing or regression, this doesn’t mean that the 
probability we use is “trivial”. The human brain has not 
evolved to do probabilistic reasoning very well, and notions 
such as conditioning and independence can be quite subtle 
and confusing even in the basic setting of tossing a random 
coin. However, this is all the more reason that studying these 
notions in this basic setting is useful not just for following 
this book, but also as a strong foundation for “fancier topics”. 


18.1 RANDOM COINS 


The nature of randomness and probability is a topic of great philo- 
sophical, scientific and mathematical depth. Is there actual random- 
ness in the world, or does it proceed in a deterministic clockwork fash- 
ion from some initial conditions set at the beginning of time? Does 
probability refer to our uncertainty of beliefs, or to the frequency of 
occurrences in repeated experiments? How can we define probability 
over infinite sets? 

These are all important questions that have been studied and de- 
bated by scientists, mathematicians, statisticians and philosophers. 
Fortunately, we will not need to deal directly with these questions 
here. We will be mostly interested in the setting of tossing n random, 


unbiased and independent coins. Below we define the basic proba- 
bilistic objects of events and random variables when restricted to this 
setting. These can be defined for much more general probabilistic ex- 
periments or sample spaces, and later on we will briefly discuss how 
this can be done. However, the n-coin case is sufficient for almost 
everything we'll need in this course. 

If instead of “heads” and “tails” we encode the sides of each coin 
by “zero” and “one”, we can encode the result of tossing n coins as 
a string in {0, 1}”. Each particular outcome xz € {0,1}” is obtained 
with probability 2~”. For example, if we toss three coins, then we 
obtain each of the 8 outcomes 000, 001, 010,011, 100, 101, 110, 111 
with probability 273 = 1/8 (see also Fig. 18.1). We can describe the 
experiment of tossing n coins as choosing a string x uniformly at 
random from {0,1}", and hence we'll use the shorthand x ~ {0,1}” 
for x that is chosen according to this experiment. 

An event is simply a subset A of {0, 1}”. The probability of A, de- 
noted by Pr,,.49,1}» [A] (or Pr[A] for short, when the sample space is 
understood from the context), is the probability that an x chosen uni- 
formly at random will be contained in A. Note that this is the same as 
|A|/2” (where |A| as usual denotes the number of elements in the set 
A). For example, the probability that x has an even number of ones 
is Pr[A] where A = {z : so x; = 0 mod 2}. In the casen = 3, 
A = {000,011, 101, 110}, and hence Pr[A] = g = 4 (see Fig. 18.2). It 
turns out this is true for every n: 

Lemma 18.1 For every n > 0, 


1 
Pr D z; is even] = 1/2 
x~{0,1}” iO 


Proof of Lemma 18.1. We prove the lemma by induction on n. For the 
case n = 1 itis clear since x = 0 is even and x = 1 is odd, and hence 
the probability that x € {0,1} is even is 1/2. Letn > 1. We assume 
by induction that the lemma is true for n — 1 and we will prove it 

for n. We split the set {0,1}” into four disjoint sets Ey, E1, O9, O1, 
where for b € {0,1}, E, is defined as the set of x € {0, 1}” such that 
Zo Zn—2 has even number of ones and x,,_; = b and similarly O, is 
the set of x € {0,1}” such that z9 £„— has odd number of ones and 
XL, 1 = b. Since Ep is obtained by simply extending n — 1-length string 
with even number of ones by the digit 0, the size of Ey is simply the 
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1% coin 


24 coin 


i {x € {0,1} |xo = 0} m {x € {01} |x, = 1} 


Figure 18.1: The probabilistic experiment of tossing 
three coins corresponds to making 2 x 2 x 2 = 8 
choices, each with equal probability. In this example, 
the blue set corresponds to the event A = {x € 
{0,1}° | £o = 0} where the first coin toss is equal 
to 0, and the pink set corresponds to the event B = 
{x € {0,1}? | x4 = 1} where the second coin toss is 
equal to 1 (with their intersection having a purplish 
color). As we can see, each of these events contains 4 
elements (out of 8 total) and so has probability 1/2. 
The intersection of A and B contains two elements, 
and so the probability that both of these events occur 
is 2/8 = 1/4. 


1% coin 


24 coin 


im (x € (0,1)? |xo + x1 + x2 = 0 mod 2} 


Figure 18.2: The event that if we toss three coins 

£o, £1, £2 E {0,1} then the sum of the z,;’s is even 
has probability 1/2 since it corresponds to exactly 4 
out of the 8 possible strings of length 3. 
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number of such n— 1-length strings which by the induction hypothesis 
is 2"-'/2 = 2”-?. The same reasoning applies for E}, Op, and O}. 
Hence each one of the four sets Ey, E1, Oo, O4 is of size 2”~. Since 

x € {0,1}” has an even number of ones if and only if x € Ey UO, 
(i.e., either the first n — 1 coordinates sum up to an even number and 
the final coordinate is 0 or the first n — 1 coordinates sum up to an odd 
number and the final coordinate is 1), we get that the probability that 
x satisfies this property is 


PATA _ gn—2 at gn-2 1 
Qn gn 2 ri 
using the fact that Ey and O, are disjoint and hence |Ey U O,| = 
|Eo| + |01]. 


We can also use the intersection (N) and union (U) operators to 
talk about the probability of both event A and event B happening, or 
the probability of event A or event B happening. For example, the 
probability p that x has an even number of ones and x, = 1 is the same 
as Pr[A N B] where A = {x € {0,1}": a. x; = 0 mod 2} and 
B = {x € {0,1}" : x = 1}. This probability is equal to 1/4 for 
n > 1. (Itis a great exercise for you to pause here and verify that you 
understand why this is the case.) 

Because intersection corresponds to considering the logical AND 
of the conditions that two events happen, while union corresponds 
to considering the logical OR, we will sometimes use the ^ and V 
operators instead of N and U, and so write this probability p = Pr[A N 
B] defined above also as 


s X r=0 mod 2 A zọ=1 


4 


If A C {0,1}" is an event, then A = {0,1}" \ A corresponds to the 
event that A does not happen. Since |A| = 2” — |A|, we get that 


Pr[A] = 4! = Z4 = 1 — Al = 1 — pra] 


This makes sense: since A happens if and only if A does not happen, 
the probability of A should be one minus the probability of A. 


18.1.1 Random variables 

Events correspond to Yes/No questions, but often we want to analyze 
finer questions. For example, if we make a bet at the roulette wheel, 
we don’t want to just analyze whether we won or lost, but also how 
much we've gained. A (real valued) random variable is simply a way 
to associate a number with the result of a probabilistic experiment. 


Formally, a random variable is a function X : {0,1}" — R that maps 

every outcome x € {0,1}” to an element X(x) € R. For example, the 
function SUM : {0,1}” — R that maps z to the sum of its coordinates 
(i.e., to oe x;) is a random variable. 

The expectation of a random variable X, denoted by E[X], is the 
average value that this number takes, taken over all draws from the 
probabilistic experiment. In other words, the expectation of X is de- 
fined as follows: 

EX]= X 2 xe). 
xe{0,1}" 

If X and Y are random variables, then we can define X + Y as 
simply the random variable that maps a point x € {0,1}”" to X(x) + 
Y (a). One basic and very useful property of the expectation is that it 
is linear: 


Lemma 18.3 — Linearity of expectation. 


E[X +Y] = E[X] + E[Y] 
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Proof. 
EIX+¥]= XO 2™(X(2)+Y(2)) = 
xeE{0,1}” 
X 2X(@)+ X 2°Y@= 
xe{0,1}” xe{0,1}” 
E[X] + E[Y] 


Similarly, E[kX] = k E[X] for every k € R. 


Solved Exercise 18.1 — Expectation of sum. Let X : {0,1}” — R be the 
random variable that maps x € {0,1}" to £o + £1 +... + £n—1- Prove 
that E[X] = n/2. 


Solution: 

We can solve this using the linearity of expectation. We can de- 
fine random variables Xp, X,,...,X,,_, such that X; (x) = zx;. Since 
each z; equals 1 with probability 1/2 and 0 with probability 1/2, 
E[X,] = 1/2. Since X = oy X; by the linearity of expectation 


E[X] = E[Xo] + EX] + + EX, a) = 7 


If A is an event, then 1 , is the random variable such that 1 , (x) 
equals 1 if x € A,and1,(x) = 0 otherwise. Note that Pr[A] = E[1,] 
(can you see why?). Using this and the linearity of expectation, we 
can show one of the most useful bounds in probability theory: 


Lemma 18.4 — Union bound. For every two events A, B, Pr[A U B] < 
Pr[A] + Pr[B] 


Proof of Lemma 18.4. For every zx, the variable 1 4ug(x) < 1,4(z)+1,(2). 
Hence, Pr[AUB] = Efl qyg] < E[14+1,] = E[1 A]+E[1 5] = Pr[A]+Pr[B]. 
a 


The way we often use this in theoretical computer science is to 


argue that, for example, if there is a list of 100 bad events that can hap- 


pen, and each one of them happens with probability at most 1/10000, 
then with probability at least 1 — 100/10000 = 0.99, no bad event 
happens. 


18.1.2 Distributions over strings 

While most of the time we think of random variables as having 

as output a real number, we sometimes consider random vari- 

ables whose output is a string. That is, we can think of a map 

Y : {0,1}" — {0,1}* and consider the “random variable” Y such 
that for every y € {0,1}*, the probability that Y outputs y is equal 
to sz |{x € {0,1}” | Y(x) = y}|. To avoid confusion, we will typically 
refer to such string-valued random variables as distributions over 
strings. So, a distribution Y over strings {0,1}* can be thought of as 
a finite collection of strings yp, ---, Ym—ı E {0,1}* and probabilities 
Pos ++sPM_— (which are non-negative numbers summing up to one), 
so that Pr[Y = y,] = pi. 

Two distributions Y and Y” are identical if they assign the same 
probability to every string. For example, consider the following two 
functions Y, Y” : {0,1}? — {0, 1}. For every x € {0,1}, we define 
Y (x) = wand Y’(x) = £o(£o ® xı) where @ is the XOR operations. 
Although these are two different functions, they induce the same 
distribution over {0, 1}? when invoked on a uniform input. The distri- 
bution Y (x) for x ~ {0,1}? is of course the uniform distribution over 
{0, 1}*. On the other hand Y” is simply the map 00 ++ 00,01 > 01, 
10 + 11,11 = 10 which is a permutation of Y. 


18.1.3 More general sample spaces 

While throughout most of this book we assume that the underlying 
probabilistic experiment corresponds to tossing n independent coins, 
all the claims we make easily generalize to sampling x from a more 
general finite or countable set S (and not-so-easily generalizes to 
uncountable sets S as well). A probability distribution over a finite set 
Sis simply a function u : S — [0,1] such that >). w(x) = 1. We 
think of this as the experiment where we obtain every x € S with 
probability u(x), and sometimes denote this as x ~ u. In particular, 
tossing n random coins corresponds to the probability distribution 

u : {0,1}” — [0,1] defined as u(x) = 27” for every x € {0,1}”. An 
event A is a subset of S, and the probability of A, which we denote by 
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Figure 18.3: The union bound tells us that the proba- 
bility of A or B happening is at most the sum of the 
individual probabilities. We can see it by noting that 
for every two sets |A U B| < |A| + |B| (with equality 
only if A and B have no intersection). 
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Pr [A], is $<; u(x). A random variable is a function X : S => R, where 
the probability that X = y is equal to)? eg, X(a)=y p(x). 


18.2 CORRELATIONS AND INDEPENDENCE 


One of the most delicate but important concepts in probability is the 
notion of independence (and the opposing notion of correlations). Subtle 
correlations are often behind surprises and errors in probability and 
statistical analysis, and several mistaken predictions have been blamed 
on miscalculating the correlations between, say, housing prices in 
Florida and Arizona, or voter preferences in Ohio and Michigan. See 
also Joe Blitzstein’s aptly named talk “Conditioning is the Soul of 
Statistics”. (Another thorny issue is of course the difference between 
correlation and causation. Luckily, this is another point we don’t need to 
worry about in our clean setting of tossing n coins.) 

Two events A and B are independent if the fact that A happens 
makes B neither more nor less likely to happen. For example, if we 
think of the experiment of tossing 3 random coins x € {0,1}°, and we 
let A be the event that x» = 1 and B the event that x) + £4 + £ > 2, 
then if A happens it is more likely that B happens, and hence these 
events are not independent. On the other hand, if we let C be the event 
that x, = 1, then because the second coin toss is not affected by the 
result of the first one, the events A and C are independent. 

The formal definition is that events A and B are independent if 
Pr[A N B] = Pr[A] - Pr[B]. If Pr[A N B] > Pr[A] - Pr[.B] then we say 
that A and B are positively correlated, while if Pr|A N B] < Pr[A] - Pr[B] 
then we say that A and B are negatively correlated (see Fig. 18.4). 

If we consider the above examples on the experiment of choosing 
x € {0,1}8 then we can see that 


Pr[xo = 1] = 
Pr[zp +x; +z > 2] = Pr[{011, 101, 110, 111}] = 


ole N|= 


but 


1 


Prego =1 A £o +a, +z > 2] = Pr[{101,110, 111}] = 3 > 4-4 
and hence, as we already observed, the events {xọ = 1} and {x + 
XL, + T3 > 2} are not independent and in fact are positively correlated. 
On the other hand, Pr[zp = 1 A x, = 1] = Pr[{110,111}} = = 3-5 
and hence the events {xo = 1} and {x, = 1} are indeed independent. 


Figure 18.4: Two events A and B are independent if 
Pr[A N B] = Pr[A] - Pr[B]. In the two figures above, 
the empty x x x square is the sample space, and A 
and B are two events in this sample space. In the left 
figure, A and B are independent, while in the right 
figure they are negatively correlated, since B is less 
likely to occur if we condition on A (and vice versa). 
Mathematically, one can see this by noticing that in 
the left figure the areas of A and B respectively are 
a : x and b - x, and so their probabilities are “3 = $ 
and 2y = b respectively, while the area of A N B is 
a - b which corresponds to the probability ab, In the 
right figure, the area of the triangle B is bz which 


corresponds to a probability of +, but the area of 


AN Bis %4 for some b’ < b. This means that the 

2 A 
probability of AN B is oo < $- 4, or in other words 
Pr[A N B] < Pr[A] - Pr[B]. 


Conditional probability: If A and B are events, and A happens with 
non-zero probability then we define the probability that B happens 
conditioned on A to be Pr[B|A] = Pr[A N B]/ Pr[A]. This corresponds 
to calculating the probability that B happens if we already know 
that A happened. Note that A and B are independent if and only if 
Pr[B|A] = Pr[B]. 


More than two events: We can generalize this definition to more than 
two events. We say that events A,,..., A, are mutually independent 

if knowing that any set of them occurred or didn’t occur does not 
change the probability that an event outside the set occurs. Formally, 
the condition is that for every subset I C [k], 


Pr[Ajer Ai] = [[ P47. 


tel 


For example, if x ~ {0, 1}°, then the events {zy = 1}, {z; = 1} and 
{z> = 1} are mutually independent. On the other hand, the events 
{£o = 1},{v, = l} and {2 ) + x, = 0 mod 2} are not mutually 
independent, even though every pair of these events is independent 
(can you see why? see also Fig. 18.5). 


18.2.1 Independent random variables 

We say that two random variables X : {0,1}" —> Rand Y : {0,1}” —> R 
are independent if for every u,v € R, the events {X = u} and {Y = v} 
are independent. (We use {X = u} as shorthand for {x | X(x) = u}.) 
In other words, X and Y are independent if Pr[X = u AY =v] = 
Pr[X = u] Pr[Y = v] for every u,v € R. For example, if two random 
variables depend on the result of tossing different coins then they are 
independent: 


Lemma 18.6 Suppose that S = {so,...,5,_,;} and T = {to,...,t,_1} are 
disjoint subsets of {0, ..., n — 1} and let X,Y : {0,1}” —> R be random 
variables such that X = F(x,.,...,%5, ,)and Y = G(z,,,...,24,_,) for 
some functions F : {0,1}* + R and G : {0,1} > R. Then X and Y 
are independent. 
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X= X=] x22 Exjeven Xo TX even 
001 
010 
011 
100 


101 
110 
111 


Figure 18.5: Consider the sample space {0, 1}” and 
the events A, B, C, D, E corresponding to A: xj = 1, 
Bia, = 1, C: £o + £1 + £3 2 2, D: £o +£ +r =0 
mod 2 and E: x) + xı = 0 mod 2. We can see that 
A and B are independent, C is positively correlated 
with A and positively correlated with B, the three 
events A, B, D are mutually independent, and while 
every pair out of A, B, E is independent, the three 
events A, B, E are not mutually independent since 
their intersection has probability 2 = 4 instead of 
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Proof of Lemma 18.6. Let a,b € R, and let A = {x € {0,1}* : F(x) = a} 
and B = {x € {0,1}” : G(x) = b}. Since S and T are disjoint, we can 
reorder the indices so that S = {0, ...,k— 1} and T = {k,...,k+m-—1} 
without affecting any of the probabilities. Hence we can write PrLX = 
a NY = b| = |C|/2" where C = {2o,.--,2,_1 : (Lo,-;Lp_1) E 
AN (Zp; Zkym—1) E B}. Another way to write this using string 
concatenation is that C = {xyz : x € A, y € B,z € {0,1}" * ™}, and 
hence |C| = |A||B|2”-1-™, which means that 

IC] = KBE = Pr[X = a] Pr[Y = b]. 


Qk 9m Qn-k-m 


If X and Y are independent random variables then (letting Sx, Sy 
denote the sets of all numbers that have positive probability of being 
the output of X and Y, respectively): 


EIXY}= XO Pr[X=a^nY=b] ab =® X` PrlX =a] Pr[¥ = b] ab =® 
aESy,bESy acSx ,bESy 
(X r-a ) ae +) =e 
aES y bESy 


[X] E[ 


where the first equality (=")) follows from the independence of X 
and Y, the second equality (='*)) follows by “opening the paren- 
theses” of the right-hand side, and the third equality (=) follows 
from the definition of expectation. (This is not an “if and only if”; see 
Exercise 18.3.) 

Another useful fact is that if X and Y are independent random 
variables, then so are F(X) and G(Y) for all functions F',G : R > R. 
This is intuitively true since learning F(X) can only provide us with 
less information than does learning X itself. Hence, if learning X 
does not teach us anything about Y (and so also about G(Y)) then 
neither will learning F(X). Indeed, to prove this we can write for 
every a,b € R: 


Pr[F(X) =a AG(Y) =b] = 5 Pr[ X=Y =y] = 
x s.t.F(x)=a,y s.t. G(y)=b 
5 Pr[X = z] Pr[Y = y] = 


x s.t.F(x)=a,y s.t. G(y)=b 


( 5 pix =a) . ( 5 priy =u) = 
x s.t. F(x)=a y s.t.G(y)=b 


Pr[F(X) = a] Pr[G(Y) = 8]. 


18.2.2 Collections of independent random variables 

We can extend the notions of independence to more than two random 
variables: we say that the random variables Xo, -.. , X,_ are mutually 
independent if for every do,...,@,_; E R, 


Pr [Xo = ap Av A Xna = an1] = Pr[ Xo = ao] + Pr[X,_1 = an1]. 


And similarly, we have that 
Lemma 18.7 — Expectation of product of independent random variables. If 


Xo,...,X,_1 are mutually independent then 


n— 


eT] xd = [T Ea. 


i=0 


Lemma 18.8 — Functions preserve independence. If X9, ..., X „—1 are mu- 
tually independent, and Y), ... , Y„—ı are defined as Y, = F;(X;) for 
some functions Fp, .-., Fa—1 : R > R, then Yo,..., Yp—ı are mutually 
independent as well. 


18.3 CONCENTRATION AND TAIL BOUNDS 


The name “expectation” is somewhat misleading. For example, sup- 
pose that you and I place a bet on the outcome of 10 coin tosses, where 
if they all come out to be 1’s then I pay you 100,000 dollars and other- 
wise you pay me 10 dollars. If we let X : {0,1}'° — R be the random 
variable denoting your gain, then we see that 


E[X] = 2-1. 100000 — (1 — 2-!°)10 ~ 90. 
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But we don’t really “expect” the result of this experiment to be for 
you to gain 90 dollars. Rather, 99.9% of the time you will pay me 10 
dollars, and you will hit the jackpot 0.1% of the times. 

However, if we repeat this experiment again and again (with fresh 
and hence independent coins), then in the long run we do expect your 
average earning to be close to 90 dollars, which is the reason why 
casinos can make money ina predictable way even though every 
individual bet is random. For example, if we toss n independent and 
unbiased coins, then as n grows, the number of coins that come up 
ones will be more and more concentrated around n/2 according to the 
famous “bell curve” (see Fig. 18.6). 

Much of probability theory is concerned with so called concentration 
or tail bounds, which are upper bounds on the probability that a ran- 
dom variable X deviates too much from its expectation. The first and 
simplest one of them is Markov’s inequality: 


Theorem 18.9 — Markov’s inequality. If X is a non-negative random 
variable then for every k > 1, Pr[X > kE[X]] < 1/k. 


Proof of Theorem 18.9. Let u = E[X] and define Y = 1y,,,,. That 

is, Y (x) = lif X(x) > kwand Y(x) = 0 otherwise. Note that by 

definition, for every x, Y (x) < X/(ku). We need to show E[Y] < 1/k. 

But this follows since E[Y] < E[X/k(j)] = E[LX]/(kw) = un/(ku) = 1/k. 
| 


The averaging principle. While the expectation of a random variable 

X is hardly always the “typical value”, we can show that X is guar- 
anteed to achieve a value that is at least its expectation with positive 
probability. For example, if the average grade in an exam is 87 points, 


at least one student got a grade 87 or more on the exam. This is known 


DO 400 600 800 1000 


Figure 18.6: The probabilities that we obtain a partic- 
ular sum when we toss n = 10, 20, 100, 1000 coins 
converge quickly to the Gaussian/normal distribu- 
tion. 
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Figure 18.7: Markov’s Inequality tells us that a non- 
negative random variable X cannot be much larger 
than its expectation, with high probability. For exam- 
ple, if the expectation of X is p, then the probability 
that X > 4u must be at most 1/4, as otherwise just 
the contribution from this part of the sample space 
will be too large. 
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as the averaging principle, and despite its simplicity it is surprisingly 
useful. 


Lemma 18.10 Let X be a random variable, then Pr|X > E[X]] > 0. 


Proof. Suppose towards the sake of contradiction that Pr|X < E[X]] = 


1. Then the random variable Y = E[X] — X is always positive. By 
linearity of expectation E[Y] = E[X] — E[X] = 0. Yet by Markov, a 
non-negative random variable Y with E[Y] = 0 must equal 0 with 


probability 1, since the probability that Y > k -0 = 0 is at most 1/k for 
every k > 1. Hence we get a contradiction to the assumption that Y is 
always positive. 

a 


18.3.1 Chebyshev’s Inequality 

Markov’s inequality says that a (non-negative) random variable X 
can't go too crazy and be, say, a million times its expectation, with 
significant probability. But ideally we would like to say that with 
high probability, X should be very close to its expectation, e.g., in the 


range [0.99y, 1.014] where u = E[X]. In such a case we say that X is 


concentrated, and hence its expectation (i.e., mean) will be close to its 
median and other ways of measuring X’s “typical value”. Chebyshev’s 
inequality can be thought of as saying that X is concentrated if it has a 
small standard deviation. 

A standard way to measure the deviation of a random variable 
from its expectation is by using its standard deviation. For a random 
variable X, we define the variance of X as Var| X] = E[(X — p)?] 
where u = E[X]; i.e., the variance is the average squared distance 


of X from its expectation. The standard deviation of X is defined as 
o[X] = \/Var[X]. (This is well-defined since the variance, being an 
average of a square, is always a non-negative number.) 

Using Chebyshev’s inequality, we can control the probability that 
a random variable is too many standard deviations away from its 
expectation. 


Theorem 18.11 — Chebyshev’s inequality. Suppose thatu = [E[X] and 
o? = Var[X]. Then for every k > 0, Pr[|X — u| > ko] < 1/k?. 


Proof. The proof follows from Markov’s inequality. We define the 
random variable Y = (X — u)?. Then E[Y] = Var[X] = o?, and hence 
by Markov the probability that Y > k?o? is at most 1/k?. But clearly 
(X — u}? > k?o? if and only if |X — u| > ko. 


One example of how to use Chebyshev’s inequality is the setting 
when X = X, +--+ X,, where X,’s are independent and identically 
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distributed (i.i.d for short) variables with values in [0, 1] where each 
has expectation 1/2. Since E[X] = >), E[X;] = n/2, we would like to 
say that X is very likely to be in, say, the interval [0.499n, 0.501n]. Us- 


ing Markov’s inequality directly will not help us, since it will only tell 
us that X is very likely to be at most 100n (which we already knew, 
since it always lies between 0 and n). However, since X,,...,X,, are 
independent, 


Var[X, +- + Xn] = Var[X,] +--+ Var[X,,] . (18.1) 


(We leave showing this to the reader as Exercise 18.8.) 

For every random variable X; in [0, 1], Var[X;] < 1 (if the variable 
is always in [0, 1], it can’t be more than 1 away from its expectation), 
and hence (18.1) implies that Var[X] < n and hence o[X] < yn. For 
large n, Yn « 0.001n, and in particular if yn < 0.001n/k, we can 
use Chebyshev’s inequality to bound the probability that X is not in 
[0.499n, 0.501n] by 1/k?. 


18.3.2 The Chernoff bound 

Chebyshev’s inequality already shows a connection between inde- 
pendence and concentration, but in many cases we can hope for 

a quantitatively much stronger result. If, as in the example above, 


X = X,+...+ X,, where the X,’s are bounded i.i.d random variables 


of mean 1/2, then as n grows, the distribution of X would be roughly 
the normal or Gaussian distribution— that is, distributed according to 
the bell curve (see Fig. 18.6 and Fig. 18.8). This distribution has the 


property of being very concentrated in the sense that the probability of 
deviating k standard deviations from the mean is not merely 1/k? as is 


guaranteed by Chebyshev, but rather is roughly e~*”. Specifically, for 
anormal random variable X of expectation u and standard deviation 


g, the probability that |X — u| > ko is at most 2e~*’/?. That is, we have 


an exponential decay of the probability of deviation. 

The following extremely useful theorem shows that such expo- 
nential decay occurs every time we have a sum of independent and 
bounded variables. This theorem is known under many names in dif- 
ferent communities, though it is mostly called the Chernoff bound in 
the computer science literature: 


Theorem 18.12 — Chernoff/Hoeffding bound. If Xo,...,X,,_, are i.i.d ran- 
dom variables such that X, € [0,1] and E[X,] = p for every i, then 


for every e > 0 


Sei (18.2) 
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Figure 18.8: In the normal distribution or the bell curve, 
the probability of deviating k standard deviations 
from the expectation shrinks exponentially in k?, and 
specifically with probability at least 1 — 2e7%k’/2 a 
random variable X of expectation u and standard 
deviation o satisfies u— ko < X < u+ ko. This figure 
gives more precise bounds for k = 1, 2, 3, 4, 5, 6. 
(Image credit:Imran Baghirov) 


We omit the proof, which appears in many texts, and uses Markov’s 
inequality on i.i.d random variables Yo,..., Y„ that are of the form 
Y. 


t 


= e**: for some carefully chosen parameter A. See Exercise 18.11 
for a proof of the simple (but highly useful and representative) case 
where each X; is {0,1} valued and p = 1/2. (See also Exercise 18.12 
for a generalization. ) 


@ 


18.3.3 Application: Supervised learning and empirical risk minimization 
Here is a nice application of the Chernoff bound. Consider the task 
of supervised learning. You are given a set S of n samples of the form 
(£0; Yo); +++ (En—1; Yn_1) drawn from some unknown distribution D 
over pairs (x,y). For simplicity we will assume that z; € {0,1} and 
yi € {0,1}. (We use here the concept of general distribution over the 
finite set {0,1}""*! as discussed in Section 18.1.3.) The goal is to find 
a classifier h : {0,1} — {0,1} that will minimize the test error which 
is the probability L(h) that h(x) # y where (x,y) is drawn from the 
distribution D. That is, L(h) = Pri, y)p[h(x) # yl. 

One way to find such a classifier is to consider a collection C of po- 
tential classifiers and look at the classifier h in C that does best on the 
training set S. The classifier h is known as the empirical risk minimizer 
(see also Section 12.1.6) . The Chernoff bound can be used to show 
that as long as the number n of samples is sufficiently larger than the 
logarithm of |C], the test error L(h) will be close to its training error 
Lg(h), which is defined as the fraction of pairs (x;,y;) € S that it fails 
to classify. (Equivalently, Ê g(h) = + Lien Aa) — vl) 

Theorem 18.14 — Generalization of ERM. Let D be any distribution over 
pairs (x,y) € {0,1}'™* and @ be any set of functions mapping 
{0,1} to {0, 1}. Then for every e,ô > 0,ifn > SeA and S 
is a set of (2, Yo), -- , (Zn—1; Yn_1) samples that are drawn indepen- 
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dently from D then 
Pr [VreelL(h) —Ls(h)| < e] >Â, 


where the probability is taken over the choice of the set of samples 
S. 


In particular if |C| < 2* and n > #2849/9 


then with probability at 
least 1— ô, the classifier h, € C that minimizes that empirical test er- 
ror Lg(C) satisfies L(h,) < Lg(h,) + €, and hence its test error is at 


most e worse than its training error. 


Proof Idea: 

The idea is to combine the Chernoff bound with the union bound. 
Let k = log |@|. We first use the Chernoff bound to show that for 
every fixed h € C, if we choose S at random then the probability that 
|L(h) — Lg(h)| > € will be smaller than a. We can then use the union 
bound over all the 2* members of C to show that this will be the case 
for every h. 

* 


Proof of Theorem 18.14. Set k = log |C| and son > klog(1/6)/e?. We 
start by making the following claim. 

CLAIM: For every h € C, the probability over S that |L(h) — 
Lg(h)| > eis smaller than 5/2". 

We prove the claim using the Chernoff bound. Specifically, for ev- 


ery such A, let us define a collection of random variables X,...,X,_1 


as follows: 
em 1 h(x) Fy 
0 otherwise 
Since the samples (£9, Yo), -~ , (Cp_1; Yn—1) are drawn independently 


from the same distribution D, the random variables Xo, ...,X,,_, are 
independently and identically distributed. Moreover, for every i, 
E[X;] = L(h). Hence by the Chernoff bound (see (18.3)), the probabil- 


ity that Baars X; — n- L(h)| > en is at most e7” < e*loe(/9) < §/2% 
(using the fact that e > 2). Since L(h) = +, cin) Xir this completes 


~ n 


the proof of the claim. 

Given the claim, the theorem follows from the union bound. In- 
deed, for every h € C, define the “bad event” B, to be the event (over 
the choice of S) that |L(h) — Lg(h)| > e. By the claim Pr[B,] < 5/2", 
and hence by the union bound the probability that the union of B, for 
all h € H happens is smaller than |C|5/2" = 6. If for every h € C, B, 
does not happen, it means that for every h € KH, |L(h) — Ls(h)| < e, 


and so the probability of the latter event is larger than 1 — ô which is 
what we wanted to prove. 
a 


18.4 EXERCISES 


Exercise 18.1 Suppose that we toss three independent fair coins a,b,c € 
{0, 1}. What is the probability that the XOR of a,b, and c is equal to 1? 

What is the probability that the AND of these three values is equal to 

1? Are these two events independent? 


Exercise 18.2 Give an example of random variables X,Y : {0,1}? > R 
such that E[XY] + ELX] E[Y]. 


Exercise 18.3 Give an example of random variables X,Y : {0,1}? — R 
such that X and Y are not independent but E[XY] = E[X] E[Y]. 


Exercise 18.4 Let n be an odd number, and let X : {0,1}” — R be the 
random variable defined as follows: for every x € {0,1}", X(x) = 1 if 
Xio Ti > N/2 and X(x) = 0 otherwise. Prove that E[X] = 1/2. 


Exercise 18.5 — standard deviation. 1. Give an example for a random 
variable X such that X’s standard deviation is equal to E{|X — E[X]]] 


2. Give an example for a random variable X such that X’s standard 
deviation is not equal to E[|X — E[X]]|] 


Exercise 18.6 — Product of expectations. Prove Lemma 18.7 


Exercise 18.7 — Transformations preserve independence. Prove Lemma 18.8 
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Exercise 18.8 — Variance of independent random variables. Prove that if 

Xo, ---, Xn; are independent random variables then Var[ Xo + = + 
n—-1 

X,-1] = Hao Var[X;]. 


Exercise 18.9 — Entropy (challenge). Recall the definition of a distribution 
u over some finite set S. Shannon defined the entropy of a distribution 
4, denoted by H (u), to be X` -g u(x) log(1/u(x)). The idea is that if u 
is a distribution of entropy k, then encoding members of u will require 
k bits, in an amortized sense. In this exercise we justify this definition. 
Let u be such that H (u) = k. 

1. Prove that for every one to one function F : S — {0,1}*, 
Eau |F(2)| > k. 

2. Prove that for every e, there is some n and a one-to-one function 
F : S” — {0,1}*, such that n|F(2)| < n(k + €), where x ~ u 
denotes the experiments of choosing zo, ..., £„—ı each independently 


F Toj 


from S using the distribution p. 


Exercise 18.10 — Entropy approximation to binomial. Let H(p) = plog(1/p) + 
(1 — p)log(1/(1 — p)).! Prove that for every p € (0,1) and > 0, if nis 
large enough then? 


9(H(p)—e)n ( n ) < 9(A(p)+e)n 


where (7’) is the binomial coefficient wom which is equal to the 


number of k-size subsets of {0,..., — 1}. 


Exercise 18.11 — Chernoff using Stirling. 1. Prove that Pr, Jo yy. [D> x; = 
k] = (je. 


2. Use this and Exercise 18.10 to prove (an approximate version of) 
the Chernoff bound for the case that X9, ... , X„—1 are i.i.d. random 
variables over {0, 1} each equaling 0 and 1 with probability 1/2. 
That is, prove that for every € > 0, and X9, --. , Xn—ı as above, 

Pr[| 7g Xi — Bl > en] < 200", 


Exercise 18.12 — Poor man’s Chernoff. Exercise 18.11 establishes the Cher- 
noff bound for the case that Xp,..., X,,_; are iid variables over {0, 1} 
with expectation 1/2. In this exercise we use a slightly different 
method (bounding the moments of the random variables) to estab- 
lish a version of Chernoff where the random variables range over [0, 1] 
and their expectation is some number p € [0, 1] that may be different 
than 1/2. Let X,,...,X,,_, be iid random variables with E X, = p and 
Pr[0 < X; < 1] = 1. Define Y; = X; —p. 


1 While you don’t need this to solve this exercise, this 
is the function that maps p to the entropy (as defined 
in Exercise 18.9) of the p-biased coin distribution over 
{0, 1}, which is the function u : {0,1} — [0,1] s.y. 
u(0) = 1 — p and p(1) = p. 

? Hint: Use Stirling's formula for approximating the 
factorial function. 


1. Prove that for every jo, .-- ,jn—1 E€ N, if there exists one i such that j; 
is odd then E|]? Y#] =0. 


t 


2. Prove that for every k, E eas Y;)F] < (10kn)*/2,3 


3. Prove that for every e > 0, Pr[| >, Y;| > en] > 27®”/(01000010g1/e) 4 


Exercise 18.13 — Sampling. Suppose that a country has 300,000,000 citi- 
zens, 52 percent of which prefer the color “green” and 48 percent of 
which prefer the color “orange”. Suppose we sample n random citi- 
zens and ask them their favorite color (assume they will answer truth- 
fully). What is the smallest value n among the following choices so 
that the probability that the majority of the sample answers “green” is 
at most 0.05? 


a. 1,000 

b. 10,000 
c. 100,000 
d. 1,000,000 


Exercise 18.14 Would the answer to Exercise 18.13 change if the country 
had 300,000,000,000 citizens? 


Exercise 18.15 — Sampling (2). Under the same assumptions as Exer- 
cise 18.13, what is the smallest value n among the following choices so 
that the probability that the majority of the sample answers “green” is 


at most 27100? 


1,000 


D 


b. 10,000 
c. 100,000 
d. 1,000,000 


e. Itis impossible to get such low probability since there are fewer 
than 2'°° citizens. 
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° Hint: Bound the number of tuples jo, ... , 7,1 such 
that every j; is even and J` j; = k using the Binomial 
coefficient and the fact that in any such tuple there are 
at most k/2 distinct indices. 

4 Hint: Set k = 2[e?n/1000] and then show that if the 
event | 5> Y,| > en happens then the random variable 
(>> Y;)* is a factor of e~* larger than its expectation. 
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18.5 BIBLIOGRAPHICAL NOTES 


There are many sources for more information on discrete probability, 
including the texts referenced in Section 1.9. One particularly recom- 
mended source for probability is Harvard’s STAT 110 class, whose 
lectures are available on youtube and whose book is available online. 

The version of the Chernoff bound that we stated in Theorem 18.12 
is sometimes known as Hoeffding’s Inequality. Other variants of the 
Chernoff bound are known as well, but all of them are equally good 
for the applications of this book. 


19 
Probabilistic computation 


“in 1946 .. (Lasked myself) what are the chances that a Canfield solitaire laid 
out with 52 cards will come out successfully? After spending a lot of time 
trying to estimate them by pure combinatorial calculations, I wondered whether 
a more practical method ... might not be to lay it out say one hundred times and 
simply observe and count”, Stanislaw Ulam, 1983 


“The salient features of our method are that it is probabilistic ... and with a 
controllable miniscule probability of error.”, Michael Rabin, 1977 


In early computer systems, much effort was taken to drive out 
randomness and noise. Hardware components were prone to non- 
deterministic behavior from a number of causes, whether it was vac- 
uum tubes overheating or actual physical bugs causing short circuits 
(see Fig. 19.1). This motivated John von Neumann, one of the early 
computing pioneers, to write a paper on how to error correct computa- 
tion, introducing the notion of redundancy. 

So it is quite surprising that randomness turned out not just a hin- 
drance but also a resource for computation, enabling us to achieve 
tasks much more efficiently than previously known. One of the first 
applications involved the very same John von Neumann. While he 
was Sick in bed and playing cards, Stan Ulam came up with the ob- 
servation that calculating statistics of a system could be done much 
faster by running several randomized simulations. He mentioned this 
idea to von Neumann, who became very excited about it; indeed, it 
turned out to be crucial for the neutron transport calculations that 
were needed for development of the Atom bomb and later on the hy- 
drogen bomb. Because this project was highly classified, Ulam, von 
Neumann and their collaborators came up with the codeword “Monte 
Carlo” for this approach (based on the famous casinos where Ulam’s 
uncle gambled). The name stuck, and randomized algorithms are 
known as Monte Carlo algorithms to this day.! 

In this chapter, we will see some examples of randomized algo- 
rithms that use randomness to compute a quantity in a faster or sim- 
pler way than was known otherwise. We will describe the algorithms 
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Learning Objectives: 


e See examples of randomized algorithms 


e Get more comfort with analyzing 
probabilistic processes and tail bounds 


e Success amplification using tail bounds 
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Figure 19.1: A 1947 entry in the log book of the Har- 
vard MARK II computer containing an actual bug that 
caused a hardware malfunction. By Courtesy of the 
Naval Surface Warfare Center. 


1 Some texts also talk about “Las Vegas algorithms” 
that always return the right answer but whose run- 
ning time is only polynomial on the average. Since 
this Monte Carlo vs Las Vegas terminology is confus- 
ing, we will not use these terms anymore, and simply 
talk about randomized algorithms. 
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in an informal / “pseudo-code” way, rather than as Turing macines 
or NAND-TM/NAND-RAM programs. In Chapter 20 we will discuss 
how to augment the computational models we saw before to incorpo- 
rate the ability to “toss coins”. 


19.1 FINDING APPROXIMATELY GOOD MAXIMUM CUTS 


We start with the following example. Recall the maximum cut problem 
of finding, given a graph G = (V, E), the cut that maximizes the num- 
ber of edges. This problem is NP-hard, which means that we do not 
know of any efficient algorithm that can solve it, but randomization 
enables a simple algorithm that can cut at least half of the edges: 


Theorem 19.1 — Approximating max cut. There is an efficient probabilis- 
tic algorithm that on input an n-vertex m-edge graph G, outputs a 
cut (S, S) that cuts at least m/2 of the edges of G in expectation. 


Proof Idea: 

We simply choose a random cut: we choose a subset S of vertices by 
choosing every vertex v to be a member of S with probability 1/2 in- 
dependently. It’s not hard to see that each edge is cut with probability 
1/2 and so the expected number of cut edges is m/2. 

* 


Proof of Theorem 19.1. The algorithm is extremely simple: 


Algorithm Random Cut: 
Input: Graph G = (V, E) with n vertices and m edges. Denote V = 
(Ug; Uys Una} 


Operation: 


1. Pick x uniformly at random in {0, 1%”. 
2. Let S C V be the set {v; : x; =1, i € [n]} that includes all vertices 
corresponding to coordinates of x where x, = 1. 


3. Output the cut (5,5). 


We claim that the expected number of edges cut by the algorithm is 
m/2. Indeed, for every edge e € F, let X, be the random variable such 
that X, (x) = 1 if the edge e is cut by z, and X,(x) = 0 otherwise. For 
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every such edge e = {i,j}, X(x) = 1 if and only if x; # ,;. Since the 
pair (x;,x,;) obtains each of the values 00,01, 10, 11 with probability 
1/4, the probability that x; # x; is 1/2 and hence E[X.] = 1/2. If we let 
X be the random variable corresponding to the total number of edges 
cut by S, then X = >? _,, Xe and hence by linearity of expectation 


[X] = $ ELX,] = m(1/2) = m/2. 


Randomized algorithms work in the worst case. It is tempting to think of 
a randomized algorithm such as the one of Theorem 19.1 as an algo- 
rithm that works for a “random input graph” but it is actually much 
better than that. The expectation in this theorem is not taken over the 
choice of the graph, but rather only over the random choices of the algo- 
rithm. In particular, for every graph G, the algorithm is guaranteed to 
cut half of the edges of the input graph in expectation. That is, 


We will define more formally what “good probability” means in 
Chapter 20 but the crucial point is that this probability is always only 
taken over the random choices of the algorithm, while the input is not 
chosen at random. 


19.1.1 Amplifying the success of randomized algorithms 

Theorem 19.1 gives us an algorithm that cuts m/2 edges in expectation. 
But, as we saw before, expectation does not immediately imply con- 
centration, and so a priori, it may be the case that when we run the 
algorithm, most of the time we don’t get a cut matching the expecta- 
tion. Luckily, we can amplify the probability of success by repeating 
the process several times and outputting the best cut we find. We 
start by arguing that the probability the algorithm above succeeds in 
cutting at least m/2 edges is not too tiny. 


Lemma 19.2 The probability that a random cut in an m edge graph cuts 
at least m/2 edges is at least 1/(2m). 


Proof Idea: 

To see the idea behind the proof, think of the case that m = 1000. In 
this case one can show that we will cut at least 500 edges with proba- 
bility at least 0.001 (and so in particular larger than 1/(2m) = 1/2000). 
Specifically, if we assume otherwise, then this means that with proba- 
bility more than 0.999 the algorithm cuts 499 or fewer edges. But since 
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we can never cut more than the total of 1000 edges, given this assump- 
tion, the highest value of the expected number of edges cut is if we 
cut exactly 499 edges with probability 0.999 and cut 1000 edges with 
probability 0.001. Yet even in this case the expected number of edges 
will be 0.999 - 499 + 0.001 - 1000 < 500, which contradicts the fact that 
we've calculated the expectation to be at least 500 in Theorem 19.1. 

* 


Proof of Lemma 19.2. Let p be the probability that we cut at least m/2 
edges and suppose, towards a contradiction, that p < 1/(2m). Since 
the number of edges cut is an integer, and m/2 is a multiple of 0.5, 
by definition of p, with probability 1 — p we cut at most m/2 — 0.5 
edges. Moreover, since we can never cut more than m edges, under 
our assumption that p < 1/(2m), we can bound the expected number 
of edges cut by 


pm + (1 —p)(m/2 — 0.5) < pm + m/2—0.5 


But if p < 1/(2m) then pm < 0.5 and so the right-hand side is smaller 
than m/2, which contradicts the fact that (as proven in Theorem 19.1) 
the expected number of edges cut is at least m/2. 

a 


19.1.2 Success amplification 

Lemma 19.2 shows that our algorithm succeeds at least some of the 
time, but we’d like to succeed almost all of the time. The approach 

to do that is to simply repeat our algorithm many times, with fresh 
randomness each time, and output the best cut we get in one of these 
repetitions. It turns out that with extremely high probability we will 
get a cut of size at least m/2. For example, if we repeat this experiment 
2000m times, then (using the inequality (1 — 1/k)* < 1/e < 1/2) 

we can show that the probability that we will never cut at least m/2 
edges, where k = 2m, is at most 


(1 — 1/(2m))2000™ = (1 — 17k = ((1 — 1/k)¥) t000 < 271000 , 


More generally, the same calculations can be used to show the 
following lemma: 


Lemma 19.3 There is an algorithm that on input a graph G = (V, E) and 
a number k, runs in time polynomial in |V| and k and outputs a cut 
(S, S) such that 


Pr[number of edges cut by (S, S) >|E|/2}>1—2-". 


Proof of Lemma 19.3. The algorithm will work as follows: 
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Algorithm AMPLIFY RANDOM CUT: 


Input: Graph G = (V, E) with n vertices and m edges. Denote V = 
{Up U1, +++) Un}. Number k > 0. 


Operation: 


1. Repeat the following 200km times: 


a. Pick x uniformly at random in {0, 1}”. 
b. Let S C V be the set {v,; : x; = 1, i € [n]} that includes all 
vertices corresponding to coordinates of x where x, = 1. 


c. If (S, S) cuts at least m/2 then halt and output (S, S). 
2. Output “failed” 


We leave completing the analysis as an exercise to the reader (see 
Exercise 19.1). 


19.1.3 Two-sided amplification 
The analysis above relied on the fact that the maximum cut has one 
sided error. By this we mean that if we get a cut of size at least m/2 
then we know we have succeeded. This is common for randomized 
algorithms, but is not the only case. In particular, consider the task of 
computing some Boolean function F : {0,1}* — {0,1}. A randomized 
algorithm A for computing F, given input x, might toss coins and suc- 
ceed in outputting F(x) with probability, say, 0.9. We say that A has 
two sided errors if there is positive probability that A(x) outputs 1 when 
F(x) = 0, and positive probability that A(x) outputs 0 when F(z) = 1. 
In such a case, to simplify A’s success, we cannot simply repeat it k 
times and output 1 if a single one of those repetitions resulted in 1, 
nor can we output 0 if a single one of the repetitions resulted in 0. But 
we can output the majority value of these repetitions. By the Chernoff 
bound (Theorem 18.12), with probability exponentially close to 1 (i.e., 
1 — 2°(*)), the fraction of the repetitions where A will output F (x) 
will be at least, say 0.89, and in such cases we will of course output the 
correct answer. 

The above translates into the following theorem 


Theorem 19.4 — Two-sided amplification. If F : {0,1}* —> {0,1} is a func- 
tion such that there is a polynomial-time algorithm A satisfying 


Pr[A(x) = F(2)| > 0.51 
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foreveryx € {0,1}*, then there is a polynomial time algorithm B 
satisfying 
Pr BG) = F(x)] > 1 — 27l 


for every x € {0,1}*. 


We omit the proof of Theorem 19.4, since we will prove a more 
general result later on in Theorem 20.5. 


19.1.4 What does this mean? 

We have shown a probabilistic algorithm that on any m edge graph 

G, will output a cut of at least m/2 edges with probability at least 

1 — 271000, Does it mean that we can consider this problem as “easy”? 
Should we be somewhat wary of using a probabilistic algorithm, since 
it can sometimes fail? 

First of all, it is important to emphasize that this is still a worst case 
guarantee. That is, we are not assuming anything about the input 
graph: the probability is only due to the internal randomness of the al- 
gorithm. While a probabilistic algorithm might not seem as nice as a 
deterministic algorithm that is guaranteed to give an output, to get a 


1000 


sense of what a failure probability of 2~*°" means, note that: 


e The chance of winning the Massachusetts Mega Millions lottery is 
one over (75)° - 15, which is roughly 2~*°. So 271000 corresponds 
to winning the lottery about 300 times in a row, at which point you 
might not care so much about your algorithm failing. 


e The chance for a U.S. resident to be struck by lightning is about 
1/700000, which corresponds to about a 2~*° chance that you'll 
be struck by lightning the very second that you're reading this 
sentence (after which again you might not care so much about the 
algorithm’s performance). 


e Since the earth is about 5 billion years old, we can estimate the 
chance that an asteroid of the magnitude that caused the dinosaurs’ 
extinction will hit us this very second to be about 2~°°. It is quite 
likely that even a deterministic algorithm will fail if this happens. 


So, in practical terms, a probabilistic algorithm is just as good as 
a deterministic one. But it is still a theoretically fascinating ques- 
tion whether randomized algorithms actually yield more power, or 
whether is it the case that for any computational problem that can be 
solved by a probabilistic algorithm, there is a deterministic algorithm 
with nearly the same performance.” For example, we will see in Ex- 
ercise 19.2 that there is in fact a deterministic algorithm that can cut 
at least m/2 edges in an m-edge graph. We will discuss this question 
in generality in Chapter 20. For now, let us see a couple of examples 


? This question does have some significance to prac- 
tice, since hardware that generates high quality 
randomness at speed is non-trivial to construct. 
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where randomization leads to algorithms that are better in some sense 
than the known deterministic algorithms. 


19.2 SOLVING SAT THROUGH RANDOMIZATION 


The 3SAT problem is NP hard, and so it is unlikely that it has a poly- 
nomial (or even subexponential) time algorithm. But this does not 
mean that we can’t do at least somewhat better than the trivial 2” al- 
gorithm for n-variable 3SAT. The best known worst-case algorithms 
for 3SAT are randomized, and are related to the following simple 
algorithm, variants of which are also used in practice: 


Algorithm WalkSAT: 
Input: An n variable 3CNF formula y. 
Parameters: T, S € N 


Operation: 


1. Repeat the following T steps: 
a. Choose a random assignment x € {0,1}” and repeat the following 
for S steps: 
1. If x satisfies y then output z. 


2. Otherwise, choose a random clause (¢; V £; V ,) that x does 
not satisfy, choose a random literal in £;, £}, l and modify x to 
satisfy this literal. 


2. If all the T - S repetitions above did not result in a satisfying assign- 
ment then output Unsatisfiable 


The running time of this algorithm is S - T - poly(n), and so the 
key question is how small we can make S and T so that the proba- 
bility that WalkSAT outputs Unsatisfiable on a satisfiable formula 
y is small. It is known that we can do so with ST = O((4/3)") = 


O(1.333 ...”) (see Exercise 19.4 for a weaker result), but we'll show 


below a simpler analysis yielding ST = O(/3") = O(1.74"), which is PEE EEE 


domized algorithms for 3SAT run in time roughly 
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still much better than the trivial 2” bound.’ O(1.308”), and the best known deterministic algo- 
rithms run in time O(1.3303”) in the worst case. 
M 
Theorem 19.5 — WalkSAT simple analysis. If we set T = 100 - v3 and 
S = n/2, then the probability we output Unsatisfiable fora 


satisfiable ọ is at most 1/2. 


Proof. Suppose that y is a satisfiable formula and let «* be a satisfying 
assignment for it. For every x € {0,1}", denote by A(z, x*) the num- 
ber of coordinates that differ between x and x*. The heart of the proof 
is the following claim: 

Claim I: For every x, x* as above, in every local improvement step, 
the value A(z, x*) is decreased by one with probability at least 1/3. 
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Proof of Claim I: Since x* is a satisfying assignment, if C is a clause 
that x does not satisfy, then at least one of the variables involved in C 
must get different values in xz and x«*. Thus when we change x by one 
of the three literals in the clause, we have probability at least 1/3 of 
decreasing the distance. 

The second claim is that our starting point is not that bad: 

Claim 2: With probability at least 1/2 over a random x € {0,1}", 
A(x, 2*) < n/2. 

Proof of Claim II: Consider the map FLIP : {0,1}” — {0, 1}” that 
simply “flips” all the bits of its input from 0 to 1 and vice versa. That 
is, FLIP(xpo, ... 
one. Moreover, if x is of distance k to x*, then FLIP({x) is distance n — k 
to x*. Now let B be the “bad event” in which z is of distance > n/2 
from «*. Then the set A = FLIP(B) = {FLIP(x) : x € B} satisfies 
|A| = |B| and that if x € A then z is of distance < n/2 from x*. Since 
A and B are disjoint events, Pr[A] + Pr[B] < 1. Since they have the 
same cardinality, they have the same probability and so we get that 
2 Pr[B] < 1 or Pr[B] < 1/2. (See also Fig. 19.2). 

Claims I and II imply that each of the T iterations of the outer loop 
succeeds with probability at least 1/2 - V3 ". Indeed, by Claim II, 
the original guess x will satisfy A(z,x*) < n/2 with probability 
Pr[A(x,x*) < n/2] > 1/2. By Claim I, even conditioned on all the 
history so far, for each of the S = n/2 steps of the inner loop we have 


,Zn—1) = (1—2p,...,1 — 2,_,). Clearly FLIP is one to 


probability at least > 1/3 of being “lucky” and decreasing the distance 
(i.e. the output of A) by one. The chance we will be lucky in all n/2 
steps is hence at least (1/3)"/? = V3 ” 

Since any single iteration of the outer loop succeeds with probabil- 
ity at least 4 - \/3 ", the probability that we never do so in T = 100/3" 
repetitions is at most (1 — sige os < (fey. 

a 


19.3 BIPARTITE MATCHING 


The matching problem is one of the canonical optimization problems, 
arising in all kinds of applications: matching residents with hospitals, 
kidney donors with patients, flights with crews, and many others. 
One prototypical variant is bipartite perfect matching. In this problem, 
we are given a bipartite graph G = (L U R, E) which has 2n vertices 
partitioned into n-sized sets L and R, where all edges have one end- 
point in L and the other in R. The goal is to determine whether there 
is a perfect matching, a subset M C E of n disjoint edges. That is, M 
matches every vertex in L to a unique vertex in R. 

The bipartite matching problem turns out to have a polynomial- 
time algorithm, since we can reduce finding a matching in G to find- 


Figure 19.2: For every x* € {0, 1}", we can sort all 
strings in {0, 1}" according to their distance from 

x* (top to bottom in the above figure), where we let 
A = {x € {0,1}” | dist(x,x* < n/2} be the “top 
half” of strings. If we define FLIP : {0,1}” — {0,1} 
to be the map that “flips” the bits of a given string x 
then it maps every x € A to an output FLIP(x) € A 
in a one-to-one way, and so it demonstrates that 

|A| < |A| which implies that Pr[A] > Pr[A] and hence 
Pr[ A] > 1/2. 


Figure 19.3: The bipartite matching problem in the 
graph G = (LUR, E) can be reduced to the minimum 
s, t cut problem in the graph G” obtained by adding 
vertices s, t to G, connecting s with L and connecting 
t with R. 


ing a maximum flow (or equivalently, minimum cut) in a related 
graph G” (see Fig. 19.3). However, we will see a different probabilistic 
algorithm to determine whether a graph contains such a matching. 
Let us label G’s vertices as L = {€,...,£,_,} and R = {ro,...,7,_1}- 
A matching M corresponds to a permutation r € S., (i.e., one-to-one 
and onto function 7 : [n] —> [n]) where for every i € [n], we define 7(i) 
to be the unique j such that M contains the edge {/;,7;}. Define an 
n x n matrix A = A(G) where A; ; = 1 if and only if the edge {;,r;} 
is present and A; ; = 0 otherwise. The correspondence between 
matchings and permutations implies the following claim: 


Lemma 19.6 — Matching polynomial. Define P = P(G) to be the polynomial 
mapping R”” to R where 


n—-1 n—-1 
P(2o,0) +++ 2n—-1,n-1) = (i siano) II Ti nli) (19.1) 
i=0 


TES, \i=0 


Then G has a perfect matching if and only if P is not identically zero. 
That is, G has a perfect matching if and only if there exists some as- 
signment x = (2; 5); jen] E R”? such that P(x) £ 0.4 

Proof. If G has a perfect matching M*, then let 7* be the permutation 
corresponding to M and let «* € R?’ defined as follows: zij =lifj= 
m*(i) and x; ; = 0 otherwise. (That is, x}, = 1 iff 7*(i) = j.) We claim 
that P(x*) = sign(a*) which in particular means that P is not identi- 
cally zero. To see why this is true, write P(x*) = Dem sign(m)P_(x*) 
where P,(x*) = IIo 4in0T} na But for all + # 7* there will 

be some i such that m(i) # 7*(j) and so z} z(a) 
that II, (x*) = 0. On the other hand, since 7* is a matching in G, 
Aima = 1 for all i, and hence P,.(2*) = Le. Aimi) ma) = Land 
so P(a*) = sign(*). 

On the other hand, suppose that P is not identically zero. By (19.1), 
this means there is some x € {0,1}" and some permutation 7 such 
that i. Ajx(i)Ti,n(i) # 0. But for this to happen, it must be that 
Aiza) # 0 for all i, which means that for every i, the edge (i, 7(7)) 
exists in the graph, and hence 7 must be a perfect matching in G. 


= 0, which means 


As we've seen before, for every x € R”’, we can compute P(x) 
by simply computing the determinant of the matrix A(x), which is 
obtained by replacing A; ; with A; jx; j- This reduces testing perfect 
matching to the zero testing problem for polynomials: given some poly- 
nomial P(-), test whether P is identically zero or not. The intuition 
behind our randomized algorithm for zero testing is the following: 


If a polynomial is not identically zero, then it can’t have “too many” roots. 
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* The sign of a permutation z : [n] > [n], denoted by 
sign(7), can be defined in several equivalent ways, 
one of which is that it equals —1 if the number of 
pairs x < ys.t. m(x) > a(y) is odd and equals +1 
otherwise. The importance of the term sign(z) is 
that it makes P equal to the determinant of the matrix 
(z£; j) and hence efficiently computable. 
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This intuition sort of makes sense. For one variable polynomi- 
als, we know that a non-zero linear function has at most one root, a 
quadratic function (e.g., a parabola) has at most two roots, and gener- 
ally a degree d equation has at most d roots. While in more than one 
variable there can be an infinite number of roots (e.g., the polynomial 
Xo + Yo vanishes on the line y = —2) it is still the case that the set of 
roots is very “small” compared to the set of all inputs. For example, 
the root of a bivariate polynomial form a curve, the roots of a three- 
variable polynomial form a surface, and more generally the roots of an 
n-variable polynomial are a space of dimension n — 1. 

This intuition leads to the following simple randomized algorithm: 


To decide if P is identically zero, choose a “random” input x and check if 


P(x) £0. 


This makes sense: if there are only “few” roots, then we expect that 
with high probability the random input « is not going to be one of 
those roots. However, to transform this into an actual algorithm, we 
need to make both the intuition and the notion of a “random” input 
precise. Choosing a random real number is quite problematic, espe- 
cially when you have only a finite number of coins at your disposal, 
and so we start by reducing the task to a finite setting. We will use the 
following result: 


Theorem 19.7 — Schwartz-Zippel lemma. For every integer q, and poly- 
nomial P : R” — R with integer coefficients. If P has degree at 
most d and is not identically zero, then it has at most dq” roots in 
the set [q]” = {(a, ---;Un_—1) : z; E {0,...,q—1}}. 


We omit the (not too complicated) proof of Theorem 19.7. We 
remark that it holds not just over the real numbers but over any field 
as well. Since the matching polynomial P of Lemma 19.6 has degree at 
most n, Theorem 19.7 leads directly to a simple algorithm for testing if 
it is non-zero: 


Algorithm Perfect-Matching: 
Input: Bipartite graph G on 2n vertices {lp, ...,€,_15 0s sTn—1}- 


Operation: 


1. For every i,j € [n], choose zx, ; independently at random from 
[2n] = {0,...2n — 1}. 


2. Compute the determinant of the matrix A(x) whose (i, 3)!” 


equals x; ; if the edge {4;, r;} is present and 0 otherwise. 


entry 


3. Output no perfect matching if this determinant is zero, and out- 
put perfect matching otherwise. 


This algorithm can be improved further (e.g., see Exercise 19.5). 
While it is not necessarily faster than the cut-based algorithms for per- 
fect matching, it does have some advantages. In particular, it is more 
amenable for parallelization. (However, it also has the significant dis- 
advantage that it does not produce a matching but only states that one 
exists.) The Schwartz—Zippel Lemma, and the associated zero testing 
algorithm for polynomials, is widely used across computer science, 
including in several settings where we have no known deterministic 
algorithm matching their performance. 


©) Chapter Recap 


e Using concentration results, we can amplify in 
polynomial time the success probability of a prob- 
abilistic algorithm from a mere 1/p(n) to1 — 2-4”) 
for every polynomials p and q. 


e There are several randomized algorithms that are 
better in various senses (e.g., simpler, faster, or 
other advantages) than the best known determinis- 
tic algorithm for the same problem. 


19.4 EXERCISES 


Exercise 19.1 — Amplification for max cut. Prove Lemma 19.3 


Exercise 19.2 — Deterministic max cut algorithm. 5 


Exercise 19.3 — Simulating distributions using coins. Our model for proba- 
bility involves tossing n coins, but sometimes algorithm require sam- 
pling from other distributions, such as selecting a uniform number in 
{0, ..., M — 1} for some M. Fortunately, we can simulate this with an 
exponentially small probability of error: prove that for every M, if n > 
k[log M], then there is a function F : {0,1}” — {0,..., M — 1} U {1L} 
such that (1) The probability that F(x) = L is at most 2~* and (2) the 
distribution of F(x) conditioned on F(x) # L is equal to the uniform 
distribution over {0, ..., M —1}.° 


Exercise 19.4 — Better walksat analysis. 1. Prove that for every e > 0, if 
n is large enough then for every x* € {0,1}" Pr,. go 1jn[A(z, 2*) < 
n/3] < 2-0-#U/3)-9" where H(p) = plog(1/p)+(1—p) log(1/(1—p)) 
is the same function as in Exercise 18.10. 

2. Prove that 2° 4)+ bes — (3/2). 

3. Use the above to prove that for every ô > 0 and large enough n, if 
we set T = 1000 - (3/2 + ô)” and S = n/4 in the WalkSAT algorithm 
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° TODO: add exercise to give a deterministic max cut 
algorithm that gives m/2 edges. Talk about greedy 
approach. 


6 Hint: Think of x € {0, 1}” as choosing k numbers 
Yis, Yp E {0,..., 2108M] — 1}. Output the first such 
number that is in {0,..., M — 1}. 
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then for every satisfiable 3CNF y, the probability that we output 
unsatisfiable is at most 1/2. 


Exercise 19.5 — Faster bipartite matching (challenge). (to be completed: im- 
prove the matching algorithm by working modulo a prime) 


19.5 BIBLIOGRAPHICAL NOTES 


The books of Motwani and Raghavan [MR95] and Mitzenmacher and 
Upfal [MU17] are two excellent resources for randomized algorithms. 
Some of the history of the discovery of Monte Carlo algorithm is cov- 

ered here. 
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Learning Objectives: 


e Formal definition of probabilistic polynomial 
time: the class BPP. 


Proof that every function in BPP can be 
computed by poly(n)-sized NAND-CIRC 
programs/circuits. 


Relations between BPP and NP. 


Pseudorandom generators 


20 
Modeling randomized computation 


“Any one who considers arithmetical methods of producing random digits is, of 
course, in a state of sin.” John von Neumann, 1951. 


So far we have described randomized algorithms in an informal 
way, assuming that an operation such as “pick a string x € {0,1}”” 
can be done efficiently. We have neglected to address two questions: 


1. How do we actually efficiently obtain random strings in the physi- 
cal world? 


2. What is the mathematical model for randomized computations, 
and is it more powerful than deterministic computation? 


The first question is of both practical and theoretical importance, 
but for now let’s just say that there are various physical sources of 
“random” or “unpredictable” data. A user’s mouse movements and 
typing pattern, (non-solid state) hard drive and network latency, 
thermal noise, and radioactive decay have all been used as sources for 
randomness (see discussion in Section 20.8). For example, many Intel 
chips come with a random number generator built in. One can even 
build mechanical coin tossing machines (see Fig. 20.1). 


Figure 20.1: A mechanical coin tosser built for Percy 
Diaconis by Harvard technicians Steve Sansone and 
Rick Haggerty 


Compiled on 12.19.2022 22:58 


558 INTRODUCTION TO THEORETICAL COMPUTER SCIENCE 


20.1 MODELING RANDOMIZED COMPUTATION 


Modeling randomized computation is actually quite easy. We can 
add the following operations to any programming language such as 
NAND-TM, NAND-RAM, NAND-CIRC ete..: 


foo = RAND() 


where foo is a variable. The result of applying this operation is that 
foo is assigned a random bit in {0,1}. (Every time the RAND operation 
is invoked it returns a fresh independent random bit.) We call the 
programming languages that are augmented with this extra operation 
RNAND-TM, RNAND-RAM, and RNAND-CIRC respectively. 

Similarly, we can easily define randomized Turing machines as 
Turing machines in which the transition function 6 gets as an extra 
input (in addition to the current state and symbol read from the tape) 
a bit b that in each step is chosen at random €{0, 1}. Of course the 
transition function can ignore this bit (and have the same output 
regardless of whether b = 0 or b = 1), and hence randomized Turing 
machines generalize deterministic Turing machines. 


MODELING RANDOMIZED COMPUTATION 


We can use the RAND() operation to define the notion of a function 
being computed by a randomized T(n) time algorithm for every nice 
time bound T : N —> N, as well as the notion of a finite function being 
computed by a size S randomized NAND-CIRC program (or, equiv- 
alently, a randomized circuit with S gates that correspond to either 
the NAND or coin-tossing operations). However, for simplicity we 
will not define randomized computation in full generality, but simply 
focus on the class of functions that are computable by randomized 
algorithms running in polynomial time, which by historical convention 
is known as BPP: 


Definition 20.1 — The class BPP. Let F : {0,1}* — {0,1}. We say that 
F € BPP if there exist constants a,b € N and an RNAND-TM 
program P such that for every x € {0,1}*, on input z, the program 
P halts within at most a|x|° steps and 


Pr[P(x) = F(«)] > 


WIN 


(20.1) 


where this probability is taken over the result of the RAND opera- 
tions of P. 


Note that the probability in (20.1) is taken only over the ran- 
dom choices in the execution of P and not over the choice of the in- 
put x. In particular, as discussed in Big Idea 24, BPP is still a worst 
case complexity class, in the sense that if F' is in BPP then there is a 
polynomial-time randomized algorithm that computes F with proba- 
bility at least 2/3 on every possible (and not just random) input. 

The same polynomial-overhead simulation of NAND-RAM pro- 
grams by NAND-TM programs we saw in Theorem 13.5 extends to 
randomized programs as well. Hence the class BPP is the same re- 
gardless of whether it is defined via RNAND-TM or RNAND-RAM 
programs. Similarly, we could have just as well defined BPP using 
randomized Turing machines. 

Because of these equivalences, below we will use the name “poly- 
nomial time randomized algorithm” to denote a computation that can be 
modeled by a polynomial-time RNAND-TM program, RNAND-RAM 
program, or a randomized Turing machine (or any programming lan- 
guage that includes a coin tossing operation). Since all these models 
are equivalent up to polynomial factors, you can use your favorite 
model to capture polynomial-time randomized algorithms without 
any loss in generality. 


Solved Exercise 20.1 — Choosing from a set. Modern programming lan- 
guages often involve not just the ability to toss a random coin in {0, 1} 
but also to choose an element at random from a set S. Show that you 
can emulate this primitive using coin tossing. Specifically, show that 
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there is a randomized algorithm A that, on input a set S of m strings 
of length n, runs in time poly(n, m) and outputs either an element 
x € S or “fail” such that 


1. Let p be the probability that A outputs “fail”, then p < 2°" (a 
number small enough that it can be ignored). 


2. For every x € S, the probability that A outputs x is exactly te 
(and so the output is uniform over S if we ignore the tiny probabil- 
ity of failure) 


Solution: 

If the size of S is a power of two, that ism = 2‘ for some £ € N, 
then we can choose a random element in S by tossing £ coins to 
obtain a string w € {0,1} and then output the i-th element of 9 
where i is the number whose binary representation is w. 

If S is not a power of two, then our first attempt will be to let 
L = {[logm] and do the same, but then output the i-th element of 
Sifi € [mJ] and output “fail” otherwise. Conditioned on not out- 
putting “fail”, this element is distributed uniformly in S. However, 
in the worst case, 2° can be almost 2m and so the probability of fail 
might be close to half. To reduce the failure probability, we can 
repeat the experiment above n times. Specifically, we will use the 


following algorithm 


Conditioned on not failing, the output of Algorithm 20.2 is uni- 
formly distributed in S. However, since 2f < 2m, the probability 
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of failure in each iteration is less than 1/2 and so the probability of 
failure in all of them is at most (1/2)” = 27”. 


20.1.1 An alternative view: random coins as an “extra input” 

While we presented randomized computation as adding an extra 
“coin tossing” operation to our programs, we can also model this as 
being given an additional extra input. That is, we can think of a ran- 
domized algorithm A as a deterministic algorithm A’ that takes two 
inputs x and r where the second input r is chosen at random from 

{0, 1}"" for some m € N (see Fig. 20.2). The equivalence to the Defini- 
tion 20.1 is shown in the following theorem: 


Theorem 20.3 — Alternative characterization of BPP. Let F : {0,1}* —> 
{0,1}. Then F € BPP if and only if there existsa,b € Nand E humarak 
G : {0,1}* > {0, 1} such that G is in P and for every x € {0,1}*, * 


Pr |G(2r) = F(x)| > 2 : (20.2) 
r~{0,1} el! 
Proof Idea: Figure 20.2: The two equivalent views of random- 
: : : ’ : . ized algorithms. We can think of such an algorithm 
The idea behind the proof is that, as illustrated in Fig. 20.2, we can as having access to an internal RAND() operation 
simply replace sampling a random coin with reading a bit from the that outputs a random independent value in {0, 1} 


whenever it is invoked, or we can think of it as a de- 
terministic algorithm that in addition to the standard 
need to work through some slightly cumbersome formal notation. input x € {0, 1}” obtains an additional auxiliary 


This might be one of those proofs that is easier to work out on your input r € {0,1} that is chosen uniformly at random. 


own than to read. 


extra “random input” r and vice versa. To prove this rigorously we 


* 


Proof of Theorem 20.3. We start by showing the “only if” direction. 
Let F € BPP and let P be an RNAND-IM program that computes 
F as per Definition 20.1, and let a,b € N be such that on every input 
of length n, the program P halts within at most an? steps. We will 
construct a polynomial-time algorithm P’ such that for every x € 
{0,1}", if we set m = an”, then 

E M [P (ær) = 1] = Pr[P(x) = 1], 
where the probability in the right-hand side is taken over the RAND () 
operations in P. In particular this means that if we define G(r) = 
P’(xr) then the function G satisfies the conditions of (20.2). 

The algorithm P’ will be very simple: it simulates the program P, 
maintaining a counter i initialized to 0. Every time that P makes a 
RAND () operation, the program P’ will supply the result from r; and 
increment i by one. We will never “run out” of bits, since the running 
time of P is at most an? and hence it can make at most this number of 
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RAND() calls. The output of P’(xr) for a random r ~ {0,1} will be 
distributed identically to the output of P(x). 

For the other direction, given a function G € P satisfying the condi- 
tion (20.2) and a NAND-TM P’ that computes G in polynomial time, 
we can construct an RNAND-IM program P that computes F in poly- 
nomial time. On input x € {0,1}", the program P will simply use the 
RAND() instruction an? times to fill an array R[0] , ..., Ran? — 1] and 
then execute the original program P’ on input xr where r; is the i-th 
element of the array R. Once again, it is clear that if P’ runs in polyno- 
mial time then so will P, and for every input x and r € {0, 1}°”", the 
output of P on input x and where the coin tosses outcome is r is equal 
to P’(ar). 

E 


”Random tapes”. Theorem 20.3 motivates sometimes considering the 
randomness of an RNAND-IM (or RNAND-RAM) program as an 
extra input. As such, if A is a randomized algorithm that on inputs of 


length n makes at most m coin tosses, we will often use the notation 
A(x;r) (where x € {0,1}” andr € {0,1}™) to refer to the result of 
executing x when the coin tosses of A correspond to the coordinates 
of r. This second, or “auxiliary,” input is sometimes referred to as 


a “random tape.” This terminology originates from the model of 
randomized Turing machines. 


20.1.2 Success amplification of two-sided error algorithms 
The number 2/3 might seem arbitrary, but as we’ve seen in Chapter 19 
it can be amplified to our liking: 


Theorem 20.5 — Amplification. Let F : {0,1}* — {0,1} bea Boolean 
function such that there is a polynomialp : N — Nanda 
polynomial-time randomized algorithm A satisfying that for ev- 
ery x € {0,1}”, 


(20.3) 


Then for every polynomial gq : N — N there is a polynomial-time 
randomized algorithm B satisfying for every x € {0,1}”, 


Pr|B(x£) = F(£)] > 1—27 , 


Proof Idea: 


The proof is the same as we’ve seen before in the case of maximum 
cut and other examples. We use the Chernoff bound to argue that if 
A computes F with probability at least + € and we run it O(k/e”) 
times, each time using fresh and independent random coins, then the 
probability that the majority of the answers will not be correct will be 
less than 2~*. Amplification can be thought of as a “polling” of the 
choices for randomness for the algorithm (see Fig. 20.3). 

* 


Proof of Theorem 20.5. Let A be an algorithm satisfying (20.3). Set 

e= an) and k = q(n) where p,q are the polynomials in the theorem 
statement. We can run P on input x for t = 10k/e? times, using fresh 
randomness in each execution, and compute the outputs yo, ... , Y+—1- 
We output the value y that appeared the largest number of times. Let 
X; be the random variable that is equal to 1 if y; = F(x) and equal to 
0 otherwise. The random variables Xo, ..., X,_, are i.i.d. and satisfy 
[X,] = Pr[X; = 1] > 1/2 + e, and hence by linearity of expectation 
ba X;] > t(1/2 + €). For the plurality value to be incorrect, it must 
hold that ce X, < t/2, which means that E X; is at least et far 
from its expectation. Hence by the Chernoff bound (Theorem 18.12), 
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the probability that the plurality value is not correct is at most 2e~“*, 
which is smaller than 2~* for our choice of t. 


20.2 BPP AND NP COMPLETENESS 


Since “noisy processes” abound in nature, randomized algorithms can 
be realized physically, and so it is reasonable to propose BPP rather 
than P as our mathematical model for “feasible” or “tractable” com- 
putation. One might wonder if this makes all the previous chapters 
irrelevant, and in particular if the theory of NP completeness still 
applies to probabilistic algorithms. Fortunately, the answer is Yes: 


Theorem 20.6 — NP hardness and BPP. Suppose that F is NP-hard and 
F € BPP. Then NP C BPP. 


Before seeing the proof, note that Theorem 20.6 implies that if there 
was a randomized polynomial time algorithm for any NP-complete 
problem such as 3SAT, ISET etc., then there would be such an algo- 
rithm for every problem in NP. Thus, regardless of whether our model 
of computation is deterministic or randomized algorithms, NP com- 
plete problems retain their status as the “hardest problems in NP.” 


Proof Idea: 

The idea is to simply run the reduction as usual, and plug it into 
the randomized algorithm instead of a deterministic one. It would 
be an excellent exercise, and a way to reinforce the definitions of NP- 
hardness and randomized algorithms, for you to work out the proof 
for yourself. However for the sake of completeness, we include this 
proof below. 

* 


Proof of Theorem 20.6. Suppose that F is NP-hard and F € BPP. 
We will now show that this implies that NP C BPP. LetG € NP. 
By the definition of NP-hardness, it follows that G < p F, or that in 
other words there exists a polynomial-time computable function R : 
{0, 1}* — {0, 1}* such that G(x) = F(R(x)) for every x € {0,1}*. Now 
if F is in BPP then there is a polynomial-time RNAND-TM program P 
such that 

Pr[P(y) = F(y)] > 2/3 (20.4) 


for every y € {0,1}* (where the probability is taken over the random 
coin tosses of P). Hence we can get a polynomial-time RNAND-TM 
program P’ to compute G by setting P’(z) = P(R(x)). By (20.4) 
Pr[P (x) = F(R(x))] > 2/3 and since F(R(x)) = G(x) this implies that 
Pr[P (x) = G(x)| > 2/3, which proves that G € BPP. 


tiitt ititi 
ttiit ARRAY 
titt) tiitt 


F(xX)=1 F(x) =0 
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Figure 20.3: If F € BPP then there is a randomized 
polynomial-time algorithm P with the following 
property: In the case F(x) = 0 two thirds of the 
“population” of random choices satisfy P(x; r) = 0 
and in the case F(x) = 1 two thirds of the population 
satisfy P(x;r) = 1. We can think of amplification as 
a form of “polling” of the choices of randomness. By 
the Chernoff bound, if we poll a sample of O( 1080/8) 
random choices r, then with probability at least 

1 — 6, the fraction of r’s in the sample satisfying 

P(x; r) = 1 will give us an estimate of the fraction of 
the population within an e margin of error. This is the 
same calculation used by pollsters to determine the 
needed sample size in their polls. 


Most of the results we’ve seen about NP hardness, including the 
search to decision reduction of Theorem 16.1, the decision to optimiza- 
tion reduction of Theorem 16.3, and the quantifier elimination result 
of Theorem 16.6, all carry over in the same way if we replace P with 
BPP as our model of efficient computation. Thus if NP C BPP then 
we get essentially all of the strange and wonderful consequences of 
P = NP. Unsurprisingly, we cannot rule out this possibility. In fact, 
unlike P = EXP, which is ruled out by the time hierarchy theorem, we 
don’t even know how to rule out the possibility that BPP = EXP! Thus 
a priori it’s possible (though seems highly unlikely) that randomness 
is a magical tool that allows us to speed up arbitrary exponential time 
computation.! Nevertheless, as we discuss below, it is believed that 
randomization’s power is much weaker and BPP lies in much more 
“pedestrian” territory. 


20.3 THE POWER OF RANDOMIZATION 


A major question is whether randomization can add power to compu- 
tation. Mathematically, we can phrase this as the following question: 
does BPP = P? Given what we've seen so far about the relations of 
other complexity classes such as P and NP, or NP and EXP, one might 
guess that: 


1. We do not know the answer to this question. 
2. But we suspect that BPP is different than P. 


One would be correct about the former, but wrong about the latter. 
As we will see, we do in fact have reasons to believe that BPP = P. 
This can be thought of as supporting the extended Church Turing hy- 
pothesis that deterministic polynomial-time Turing machines capture 
what can be feasibly computed in the physical world. 

We now survey some of the relations that are known between 
BPP and other complexity classes we have encountered. (See also 
Fig. 20.4.) 


20.3.1 Solving BPP in exponential time 
It is not hard to see that if F is in BPP then it can be computed in 
exponential time. 


Theorem 20.7 — Simulating randomized algorithms in exponential time. 


BPP C EXP 


MODELING RANDOMIZED COMPUTATION 565 


1 At the time of this writing, the largest “natural” com- 
plexity class which we can’t rule out being contained 
in BPP is the class NEXP, which we did not define 

in this course, but corresponds to non-deterministic 
exponential time. See this paper for a discussion of 
this question. 


Figure 20.4: Some possibilities for the relations be- 
tween BPP and other complexity classes. Most 
researchers believe that BPP = P and that these 
classes are not powerful enough to solve NP-complete 
problems, let alone all problems in EXP. However, 
we have not even been able yet to rule out the possi- 
bility that randomness is a “silver bullet” that allows 
exponential speedup on all problems, and hence 
BPP = EXP. As we've already seen, we also can’t 
rule out that P = NP. Interestingly, in the latter case, 
P = BPP. 
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20.3.2 Simulating randomized algorithms by circuits 

We have seen in Theorem 13.12 that if F is in P, then there is a polyno- 
mial p : N — N such that for every n, the restriction Fn, of F to inputs 
{0, 1}" is in SIZE(p(n)). (In other words, that P C P)po1y.) A priori it is 
not at all clear that the same holds for a function in BPP, but this does 
turn out to be the case. 


“Average case” BPP 


“Offline” BPP 


n 


random choices 


inputs inputs inputs 


2 
2 e eon = 
Vaetoay", Ph „AGr = FOO] > 5 mophyml¥xetoay AGT) = FQ] 25 


2 
[Ak r) = F(x)] = 3 r~{o,1}" 


Pr 
x~{0,1}",r~{0,1}" 


Theorem 20.8 — Randomness does not help for non-uniform computation. 


BIP ED 


That is, for every F € BPP, there exist some a,b € N such that 
for every n > 0, Fin € SIZE(an*) where F, is the restriction of F to 
inputs in {0,1}”. 


Proof Idea: 

The idea behind the proof is that we can first amplify by repetition 
the probability of success from 2/3 to 1— 0.1- 27". This will allow us to 
show that for every n € N there exists a single fixed choice of “favorable 
coins” which is a string r of length polynomial in n such that if r is 
used for the randomness then we output the right answer on all of 
the possible 2” inputs. We can then use the standard “unravelling the 
loop” technique to transform an RNAND-IM program to an RNAND- 
CIRC program, and “hardwire” the favorable choice of random coins 


Figure 20.5: The possible guarantees for a random- 
ized algorithm A computing some function F. In 
the tables above, the columns correspond to differ- 
ent inputs and the rows to different choices of the 
random tape. A cell at position r, x is colored green 
if A(x; r) = F(x) (i.e., the algorithm outputs the 
correct answer) and red otherwise. The standard BPP 
guarantee corresponds to the middle figure, where 
for every input x, at least two thirds of the choices 

r for a random tape will result in A computing the 
correct value. That is, every column is colored green 
in at least two thirds of its coordinates. In the left 
figure we have an “average case” guarantee where 
the algorithm is only guaranteed to output the correct 
answer with probability two thirds over a random 
input (i.e., at most one third of the total entries of the 
table are colored red, but there could be an all red 
column). The right figure corresponds to the “offline 
BPP” case, with probability at least two thirds over 
the random choice r, r will be good for every input. 
That is, at least two thirds of the rows are all green. 
Theorem 20.8 (BPP C P/poy) is proven by amplify- 
ing the success of a BPP algorithm until we have the 
“offline BPP” guarantee, and then hardwiring the 
choice of the randomness r to obtain a non-uniform 
deterministic algorithm. 
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to transform the RNAND-CIRC program into a plain old deterministic 
NAND-CIRC program. 
* 


Proof of Theorem 20.8. Suppose that F € BPP. Let P be a polynomial- 
time RNAND-IM program that computes F as per Definition 20.1. 
Using Theorem 20.5, we can amplify the success probability of P to 
obtain an RNAND-IM program P’ that is at most a factor of O(n) 
slower (and hence still polynomial time) such that for every x € 


{0, 1}” 


aie r)=F(a#)|)>1—-01-2™, (20.5) 

where m is the number of coin tosses that P’ uses on inputs of 
length n. We use the notation P’(x;1r) to denote the execution of P’ 
on input x and when the result of the coin tosses corresponds to the 
string r. 

For every x € {0,1}", define the “bad” event B, to hold if P’(x) + 
F(x), where the sample space for this event consists of the coins of P’. 
Then by (20.5), Pr[B,] < 0.1 - 27” for every x € {0,1}”. Since there are 
2” many such 2’s, by the union bound we see that the probability that 
the union of the events {B,,},,<(o,1} is at most 0.1. This means that if 
we choose r ~ {0,1}, then with probability at least 0.9 it will be the 
case that for every x € {0,1}", F(x) = P’(x;r). (Indeed, otherwise the 
event B„ would hold for some x.) In particular, because of the mere 
fact that the probability of U,<,9,1;»B, is smaller than 1, this means 
that there exists a particular r* € {0,1}" such that 


P'(a;1r*) = F(a) (20.6) 


for every x € {0,1}”. 

Now let us use the standard “unravelling the loop” technique and 
transform P’ into a NAND-CIRC program Q of polynomial in n size, 
such that Q(ar) = P’(x;r) for every x € {0,1}" and r € {0,1}. Then 
by “hardwiring” the values rj, ... , rž, —1 in place of the last m inputs of 
Q, we obtain anew NAND-CIRC program Q, that satisfies by (20.6) 
that Q,..(x) = F(x) for every x € {0,1}”". This demonstrates that Fyn 
has a polynomial-sized NAND-CIRC program, hence completing the 
proof of Theorem 20.8. 

a 


20.4 DERANDOMIZATION 


The proof of Theorem 20.8 can be summarized as follows: we can 
replace a poly(n)-time algorithm that tosses coins as it runs with an 
algorithm that uses a single set of coin tosses r* € {0, 1}?4") which 
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will be good enough for all inputs of size n. Another way to say it is 
that for the purposes of computing functions, we do not need “online” 
access to random coins and can generate a set of coins “offline” ahead 
of time, before we see the actual input. 

But this does not really help us with answering the question of 
whether BPP equals P, since we still need to find a way to generate 
these “offline” coins in the first place. To derandomize an RNAND- 
TM program we will need to come up with a single deterministic 
algorithm that will work for all input lengths. That is, unlike in the 
case of RNAND-CIRC programs, we cannot choose for every input 
length n some string r* € {0,1}??'4\") to use as our random coins. 

Can we derandomize randomized algorithms, or does randomness 
add an inherent extra power for computation? This is a fundamentally 
interesting question but is also of practical significance. Ever since 
people started to use randomized algorithms during the Manhattan 
project, they have been trying to remove the need for randomness and 
replace it with numbers that are selected through some deterministic 
process. Throughout the years this approach has often been used 
successfully, though there have been a number of failures as well. 

A common approach people used over the years was to replace 
the random coins of the algorithm by a “randomish looking” string 
that they generated through some arithmetic progress. For example, 
one can use the digits of 7 for the random tape. Using these type of 
methods corresponds to what von Neumann referred to as a “state 
of sin”. (Though this is a sin that he himself frequently committed, 
as generating true randomness in sufficient quantity was and still is 
often too expensive.) The reason that this is considered a “sin” is that 
such a procedure will not work in general. For example, it is easy to 
modify any probabilistic algorithm A such as the ones we have seen in 
Chapter 19, to an algorithm A’ that is guaranteed to fail if the random 
tape happens to equal the digits of 7. This means that the procedure 
“replace the random tape by the digits of 7” does not yield a general 
way to transform a probabilistic algorithm to a deterministic one that 
will solve the same problem. Of course, this procedure does not always 
fail, but we have no good way to determine when it fails and when 
it succeeds. This reasoning is not specific to 7 and holds for every 
deterministically produced string, whether it obtained by 7, e, the 
Fibonacci series, or anything else. 

An algorithm that checks if its random tape is equal to 7 and then 
fails seems to be quite silly, but this is but the “tip of the iceberg” for a 
very serious issue. Time and again people have learned the hard way 
that one needs to be very careful about producing random bits using 
deterministic means. As we will see when we discuss cryptography, 


? One amusing anecdote is a recent case where scam- 
mers managed to predict the imperfect “pseudo- 
random generator” used by slot machines to cheat 
casinos. Unfortunately we don’t know the details of 
how they did it, since the case was sealed. 


many spectacular security failures and break-ins were the result of 
using “insufficiently random” coins. 


20.4.1 Pseudorandom generators 
So, we can’t use any single string to “derandomize” a probabilistic 
algorithm. It turns out however, that we can use a collection of strings 
to do so. Another way to think about it is that rather than trying to 
eliminate the need for randomness, we start by focusing on reducing the 
amount of randomness needed. (Though we will see that if we reduce 
the randomness sufficiently, we can eventually get rid of it altogether.) 
We make the following definition: 


Definition 20.9 — Pseudorandom generator. A functionG : {0,1} —> 
{0,1} is a (T, e)-pseudorandom generator if for every circuit C with 
m inputs, one output, and at most T gates, 


Pt (CG) =1- Pe [e() =1]) <« (20.7) 


We can think of a pseudorandom generator as a “randomness 


amplifier.” It takes an input s of ¢ bits chosen at random and ex- 
pands these £ bits into an output r ofm > £ pseudorandom bits. If 
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Figure 20.6: A pseudorandom generator G maps 

a short string s € {0,1} into a long string r € 

{0, 1}” such that a small program/circuit P cannot 
distinguish between the case that it is provided a 
random input r ~ {0,1}™ and the case that it 

is provided a “pseudorandom” input of the form 

r = G(s) where s ~ {0,1}. The short string s 

is sometimes called the seed of the pseudorandom 
generator, as it is a small object that can be thought as 
yielding a large “tree of randomness”. 
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e is small enough then the pseudorandom bits will “look random” 
to any NAND-CIRC program that is not too big. Still, there are two 
questions we haven’t answered: 


e What reason do we have to believe that pseudorandom generators with 
non-trivial parameters exist? 


e Even if they do exist, why would such generators be useful to derandomize 
randomized algorithms? After all, Definition 20.9 does not involve 
RNAND-TM or RNAND-RAM programs, but rather deterministic 
NAND-CIRC programs with no randomness and no loops. 


We will now (partially) answer both questions. For the first ques- 
tion, let us come clean and confess we do not know how to prove that 
interesting pseudorandom generators exist. By interesting we mean 
pseudorandom generators that satisfy that € is some small constant 
(say e < 1/3),m > £, and the function G itself can be computed in 
poly(m) time. Nevertheless, Lemma 20.12 (whose statement and proof 
is deferred to the end of this chapter) shows that if we only drop the 
last condition (polynomial-time computability), then there do in fact 
exist pseudorandom generators where m is exponentially larger than £. 


20.4.2 From existence to constructivity 


The fact that there exists a pseudorandom generator does not mean 
that there is one that can be efficiently computed. However, it turns 
out that we can turn complexity “on its head” and use the assumed 
non-existence of fast algorithms for problems such as 3SAT to obtain 
pseudorandom generators that can then be used to transform random- 
ized algorithms into deterministic ones. This is known as the Hardness 
vs Randomness paradigm. A number of results along those lines, most 
of which are outside the scope of this course, have led researchers to 
believe the following conjecture: 


Optimal PRG conjecture: There is a polynomial-time computable 
function PRG : {0,1}* — {0,1} that yields an exponentially secure 
pseudorandom generator. 


Specifically, there exists a constant 6 > 0 such that for every £ and 
m < 2%, if we define G : {0,1} > {0,1} as G(s); = PRG(s, i) for every 
s € {0,1} and i € [m], then G is a (2°, 27%) pseudorandom generator. 


MODELING RANDOMIZED COMPUTATION 571 


We emphasize again that the optimal PRG conjecture is, as its name 
implies, a conjecture, and we still do not know how to prove it. In par- 
ticular, it is stronger than the conjecture that P + NP. But we do have 
some evidence for its truth. There is a spectrum of different types of 
pseudorandom generators, and there are weaker assumptions than 
the optimal PRG conjecture that suffice to prove that BPP = P. In 
particular this is known to hold under the assumption that there exists 
a function F € TIME(20”’) and e > 0 such that for every sufficiently 
large n, Fyn is not in SIZE(2°"). The name “Optimal PRG conjecture” 
is non-standard. This conjecture is sometimes known in the literature 2 pseudorandom: generatoro ad form wapost; 
as the existence of exponentially strong pseudorandom functions.’ where each output bit can be computed individually 


in time polynomial in the seed length, is commonly 
known as a pseudorandom function generator. For more 


20.4.3 Usefulness of pseudorandom generators on the many interesting results and connections in the 
: p study of pseudorandomness, see thi h of Salil 
We now show that optimal pseudorandom generators are indeed very Va pr K a i lal 


useful, by proving the following theorem: 


Theorem 20.10 — Derandomization of BPP. Suppose that the optimal 
PRG conjecture is true. Then BPP = P. 


Proof Idea: 
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The optimal PRG conjecture tells us that we can achieve exponential 


expansion of £ truly random coins into as many as 2°! “ 


pseudorandom 
coins.” Looked at from the other direction, it allows us to reduce the 
need for randomness by taking an algorithm that uses m coins and 
converting it into an algorithm that only uses O(log m) coins. Now an 
algorithm of the latter type by can be made fully deterministic by enu- 
merating over all the 2°08) (which is polynomial in m) possibilities 
for its random choices. 

* 


We now proceed with the proof details. 


Proof of Theorem 20.10. Let F € BPP and let P be a NAND-TM pro- 
gram and a,b, c,d constants such that for every x € {0,1}”, P(x) 
runs in at most c - n? steps and Pryor [P(x;r) = F(a)] > 2/3. 
By “unrolling the loop” and hardwiring the input x, we can obtain 
for every input x € {0,1}" a NAND-CIRC program Q, of at most, 
say, T = 10c - n? lines, that takes m bits of input and such that 
Q(r) = P(z;r). 

Now suppose that G : {0,1} —> {0,1} is a (T, 0.1) pseudorandom 
generator. Then we could deterministically estimate the probability 
p(x) = Pr,.49,1)~[Q,(7) = 1] up to 0.1 accuracy in time O(T - 2°.m- 
cost(G)) where cost(G) is the time that it takes to compute a single 
output bit of G. 

The reason is that we know that p(x) = Pr, {0,1} [Q.(G(85)) = 
1] will give us such an estimate for p(x), and we can compute the 
probability (x) by simply trying all 2° possibillites for s. Now, under 
the optimal PRG conjecture we can set T = 2° or equivalently £ = 
+ log T,, and our total computation time is polynomial in 2 = T!/°. 
Since T < 10c- n4, this running time will be polynomial in n. 

This completes the proof, since we are guaranteed that 
Pr,Jro1y~lQ,.(r) = F(x)] 2 2/3, and hence estimating the 
probability p(x) to within 0.1 accuracy is sufficient to compute F(x). 

a 


20.5 P = NP AND BPP VS P 

Two computational complexity questions that we cannot settle are: 
e Is P= NP? Where we believe the answer is negative. 

e Is BPP = P? Where we believe the answer is positive. 


However we can say that the “conventional wisdom” is correct on 
at least one of these questions. Namely, if we’re wrong on the first 
count, then we'll be right on the second one: 


| Theorem 20.11 — Sipser-Gacs Theorem. If P = NP then BPP = P. 


Proof Idea: 


The construction follows the “quantifier elimination” idea which 
we have seen in Theorem 16.6. We will show that for every F € BPP, 
we can reduce the question of some input x satisfies F(x) = 1 to the 


question of whether a formula of the form 3,,<49,1)™ Yvego,1} P (U, v) 

is true, where m, k are polynomial in the length of xz and P is 
polynomial-time computable. By Theorem 16.6, if P = NP then we can 
decide in polynomial time whether such a formula is true or false. 

The idea behind this construction is that using amplification we 
can obtain a randomized algorithm A for computing F using m coins 
such that for every x € {0,1}", if F(x) = 0 then the set S C {0,1}” 
of coins that make A output 1 is extremely tiny (i.e., exponentially 
small relative to 2”), and if F(x) = 1 then S is very large (of size 
close to 2”). We then consider “shifts” of the set S: sets of the form 
S ® s where s € {0,1}” is some string, where S @ s is defined as 
{r ® s |r € S}. Note that for every such shift s, the cardinality of S @ s 
is the same as the cardinality of S. Hence, if F(x) = 0, and so S is 
“tiny”, then for every polynomial number of shifts so, ..., sp E€ {0,1}, 
the union of the sets S @ s; will not cover {0,1}. On the other hand, 
we will show that if S is very large, then there exists a polynomial 
number of such shifts such as UF}. (9 @ s;) = {0,1}™. 

We can express the condition that there exists sọ, ... , 5,_, such that 
Viet (S ® s;) = {0,1} as a statement with a constant number of 
quantifiers. (Specifically, this condition holds if for every y € {0,1}, 
there exists s € Sandi € {0,...,4—1} such that y = s @® s;.) 

* 


Proof of Theorem 20.11. Let F € BPP. Using Theorem 20.5, there 
exists a polynomial-time algorithm A such that for every x € {0,1}”, 
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F(x) =0 F(x) =1 
| A(xr)=0 
A(xr) =1 


Figure 20.7: If F € BPP then through amplification we 
can ensure that there is an algorithm A to compute 

F on n-length inputs and using m coins such that 
Pr,.o,14jm[A(ar) # F(x)] « 1/poly(m). Hence 

if F(x) = 1 then almost all of the 2” choices for r 
will cause A(ar) to output 1, while if F(a) = 0 then 
A(«r) = 0 for almost all r’s. To prove the Sipser- 
Gács Theorem we consider several “shifts” of the set 
S C {0,1}™ of the coins r such that A(xr) = 1. If 
F(x) = 1 then we can find a set of k shifts sg, ..., 8,1 
for which Uie (S ® 8;) = {0,1}. If F(x) = 0 then 
for every such set | Uje] Si] < k|S| « 2™. We can 
phrase the question of whether there is such a set of 
shifts using a constant number of quantifiers, and so 
can solve it in polynomial time if P = NP. 
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Pr,<o,1jm[A(ar) = F(x)] > 1 — 2-" where m is polynomial in n. In 
particular (since an exponential dominates a polynomial, and we can 
always assume n is sufficiently large), it holds that 
1 
eis [A(er) = F(x) > 1- iz - (20.8) 

Letz € {0,1}", and let S, C {0,1}” be the set {r € {0,1}™ 
A(xr) = 1}. By our assumption, if F(x) = 0 then |S,| < qg72” and if 
F(x) = 1 then |S,| > (1 — q )2”. 

For aset S C {0,1}” and a string s € {0,1}™, we define the set 
S@®stobe{r@®s : r € S} where @ denotes the XOR operation. That 
is, S @ s is the set S “shifted” by s. Note that |S ® s| = |S]. (Please 
make sure that you see why this is true.) 


The heart of the proof is the following two claims: 
CLAIM I: For every subset S C {0, 1}”, if |S] < 1o72” then for 
EVeTY So; -++ 5 $190m—1 © {0, 1}™, Usefrooms(S ® 8;) G {0, 1}. 

CLAIM II: For every subset S C {0,1}”, if |S] > 42™ then there 
exist a set of string 9, ... , S100m—1 Such that U;er109mj(S ® $;) = {0, 1}. 
CLAIM I and CLAIM II together imply the theorem. Indeed, they 

mean that under our assumptions, for every x € {0,1}", F(x) = 1 if 


and only if 


yrsg 


or equivalently 


which (since A is computable in polynomial time) is exactly the 
type of statement shown in Theorem 16.6 to be decidable in polyno- 
mial time if P = NP. 

We see that all that is left is to prove CLAIM I and CLAIM II. 
CLAIM I follows immediately from the fact that 


100m—1 100m—1 
VUje[to0m—1]52 ® si| S 5 |S, ® s,| = 5 |S.| = 100m|S,| - 
i=0 i=0 


To prove CLAIM II, we will use a technique known as the prob- 
abilistic method (see the proof of Lemma 20.12 for a more extensive 
discussion). Note that this is a completely different use of probability 


$1001 €{0,1}" V we{0,1}™ (A(e(wesp)) = 1VA(x(w@s,)) = 1V--VA(z(WPS100m-1)) = 1) 


than in the theorem statement, we just use the methods of probability 
to prove an existential statement. 

Proof of CLAIM II: Let S C {0,1} with |S| > 0.5 - 2™ be as 
in the claim’s statement. Consider the following probabilistic ex- 
periment: we choose 100m random shifts sọ, ... , S100m—1 indepen- 
dently at random in {0,1}, and consider the event GOOD that 
VUicfoomj(S ® s;) = {0,1}. To prove CLAIM II it is enough to show 
that Pr{GOOD] > 0, since that means that in particular there must exist 
shifts sọ, ..., $;99m-—1 that satisfy this condition. 

For every z € {0,1}, define the event BAD, to hold if z ¢ 
VUicfoom—1(S ® ;). The event GOOD holds if BAD, fails for every 
z € {0,1}™, and so our goal is to prove that Pr[U eto, 1}m BAD] < 1. By 
the union bound, to show this, it is enough to show that Pr[BAD,] < 
2-™ for every z € {0,1}. Define the event BAD to hold if z ¢ S @ s; 
Since every shift s; is chosen independently, for every fixed z the 
events BAD?,...,BAD;"°”" ' are mutually independent,4 and hence 


100m—1 

Pr[BAD,] = Pr[Mietroom—1JBAD:] = [| Pr[BAD}]. (20.9) 
i=0 
So this means that the result will follow by showing that 

Pr[BAD‘] < 4 for every z € {0,1} andi € [100m] (as that would 

allow to bound the right-hand side of (20.9) by 27199). In other 

words, we need to show that for every z € {0,1} and set S C {0,1}” 

with |S| > 42”, 


Pr [z€S@s|> 
s~{0,1}™ 


To show this, we observe that z € S @ s if and only ifs € S ẸỌ z 
(can you see why). Hence we can rewrite the probability on the left- 
hand side of (20.10) as Prs to, 1}m [S E S ® z] which simply equals 
|S 9 2|/2™ = |S|/2™ > 1/2! This concludes the proof of CLAIM I and 
hence of Theorem 20.11. 


(20.10) 


1 
a 


20.6 NON-CONSTRUCTIVE EXISTENCE OF PSEUDORANDOM GEN- 
ERATORS (ADVANCED, OPTIONAL) 


We now show that, if we don’t insist on constructivity of pseudoran- 
dom generators, then we can show that there exist pseudorandom 
generators with output that is exponentially larger in the input length. 


Lemma 20.12 — Existence of inefficient pseudorandom generators. There is 
some absolute constant C such that for every e, T, if £ > C(logT + 
log(1/e)) and m < T, then there is a (T, €) pseudorandom generator 
G : {0,1}¢ > {0,1}™. 
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* The condition of independence here is subtle. It 

is not the case that all of the 2” x 100m events 
{BAD}; } ze {0,1} ,i€[100m] are mutually independent. 
Only for a fixed z € {0, 1}™, the 100m events of the 
form BAD. are mutually independent. 
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Proof Idea: 

The proof uses an extremely useful technique known as the “prob- 
abilistic method” which is not too hard mathematically but can be 
confusing at first. The idea is to give a “non-constructive” proof of 
existence of the pseudorandom generator G by showing that if G was 
chosen at random, then the probability that it would be a valid (T, €) 
pseudorandom generator is positive. In particular this means that 
there exists a single G that is a valid (T, €) pseudorandom generator. 
The probabilistic method is just a proof technique to demonstrate the 
existence of such a function. Ultimately, our goal is to show the exis- 
tence of a deterministic function G that satisfies the condition. 

* 


The above discussion might be rather abstract at this point, but 
would become clearer after seeing the proof. 


Proof of Lemma 20.12. Let e,T,£,m be as in the lemma’s statement. We 
need to show that there exists a function G : {0,1} — {0,1}™ that 
“fools” every T line program P in the sense of (20.7). We will show 
that this follows from the following claim: 

Claim I: For every fixed NAND-CIRC program P, if we pick G : 
{0,1} — {0,1} at random then the probability that (20.7) is violated 
is at most 2-7”. 

Before proving Claim I, let us see why it implies Lemma 20.12. We 
can identify a function G : {0,1} > {0,1}™ with its “truth table” 
or simply the list of evaluations on all its possible 2° inputs. Since 
each output is an m bit string, we can also think of G as a string in 
{0, 1}""2". We define F7? to be the set of all functions from {0, 1}/ to 
{0,1}. As discussed above we can identify F? with {0,1}”?' and 
choosing a random function G ~ F% corresponds to choosing a 
random m - 2‘-long bit string. 

For every NAND-CIRC program P let Bp be the event that, if we 
choose G at random from F 7’ then (20.7) is violated with respect to 
the program P. It is important to understand what is the sample space 
that the event Bp is defined over, namely this event depends on the 
choice of G and so Bp is a subset of 77. An equivalent way to define 
the event Bp is that it is the subset of all functions mapping {0,1} to 
{0, 1}” that violate (20.7), or in other words: 


bp=foesr x >», PCO- S| Po) > 
se{0,1}¢ re{0,1}™ 
(20.11) 


° There is a whole (highly recommended) book by 
Alon and Spencer devoted to this method. 


MODELING RANDOMIZED COMPUTATION 


(We've replaced here the probability statements in (20.7) with the 
equivalent sums so as to reduce confusion as to what is the sample 
space that Bp is defined over.) 

To understand this proof it is crucial that you pause here and see 
how the definition of Bp above corresponds to (20.11). This may well 
take re-reading the above text once or twice, but it is a good exercise 
at parsing probabilistic statements and learning how to identify the 
sample space that these statements correspond to. 

Now, we've shown in Theorem 5.2 that up to renaming variables 
(which makes no difference to program’s functionality) there are 
20(Tlo8T) NAND-CIRC programs of at most T lines. Since T log T < 
T? for sufficiently large T, this means that if Claim I is true, then 
by the union bound it holds that the probability of the union of 
Bp over all NAND-CIRC programs of at most T lines is at most 
20(TlogT)2-T? < 0.1 for sufficiently large T. What is important for 
us about the number 0.1 is that it is smaller than 1. In particular this 
means that there exists a single G* € Fy such that G* does not violate 
(20.7) with respect to any NAND-CIRC program of at most T lines, 
but that precisely means that G* is a (T, €) pseudorandom generator. 

Hence to conclude the proof of Lemma 20.12, it suffices to prove 
Claim I. Choosing a random G : {0,1}! — {0, 1}” amounts to choos- 
ing L = 2‘ random strings yo, ...,y;_, € {0,1}™ and letting G(x) = y, 
(identifying {0,1} and [L] via the binary representation). This means 
that proving the claim amounts to showing that for every fixed func- 
tion P : {0,1} — {0,1}, if L > 20°87 +108) (which by setting C > 4, 
we can ensure is larger than 10T?/e?) then the probability that 


1 Pui- Pr [P(s)=l]] > (20.12) 


is at most 2-7”. 

(20.12) follows directly from the Chernoff bound. Indeed, if we 
let for every i € [L] the random variable X; denote P(y;), then 
since Yo, --- , ¥z_1 is chosen independently at random, these are in- 
dependently and identically distributed random variables with mean 


Eyxfo,1ym[P(y)] = Pry.coiy~[P(y) = 1] and hence the probability that 
71/2. 


they deviate from their expectation by «e is at most 2 - 2~¢ 
a 


© Chapter Recap 


e We can model randomized algorithms by either 
adding a special “coin toss” operation or assuming 
an extra randomly chosen input. 
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All functions F: {0,1} > {0,1} All functions F: {0,1}* > {0,1} 
R Computable functions W HALT 2» R Computable functions We HALT yn 


Possible Likely 


20.7 EXERCISES 


20.8 BIBLIOGRAPHICAL NOTES 


In this chapter we ignore the issue of how we actually get random 

bits in practice. The output of many physical processes, whether it 

is thermal heat, network and hard drive latency, user typing pat- 

tern and mouse movements, and more can be thought of as a binary 
string sampled from some distribution u that might have significant 
unpredictability (or entropy) but is not necessarily the uniform distri- 
bution over {0, 1}”. Indeed, as this paper shows, even (real-world) 
coin tosses do not have exactly the distribution of a uniformly random 
string. Therefore, to use the resulting measurements for randomized 
algorithms, one typically needs to apply a “distillation” or random- 


Figure 20.8: The relation between BPP and the other 
complexity classes that we have seen. We know that 
P C BPP C EXP and BPP C P /poly but we don’t 
know how BPP compares with NP and can’t rule out 
even BPP = EXP. Most evidence points out to the 
possibliity that BPP = P. 


ness extraction process to the raw measurements to transform them 
to the uniform distribution. Vadhan’s book | Vad+12] is an excellent 
source for more discussion on both randomness extractors and pseu- 
dorandom generators. 


The name BPP stands for “bounded probability polynomial time”. 


This is an historical accident: this class probably should have been 
called RP or PP but both names were taken by other classes. 


The proof of Theorem 20.8 actually yields more than its statement. 


We can use the same “unrolling the loop” arguments we’ve used be- 
fore to show that the restriction to {0, 1}” of every function in BPP 

is also computable by a polynomial-size RNAND-CIRC program 
(i.e., NAND-CIRC program with the RAND operation). Like in the P 
vs SIZE(poly(n)) case, there are also functions outside BPP whose 
restrictions can be computed by polynomial-size RNAND-CIRC pro- 
grams. Nevertheless the proof of Theorem 20.8 shows that even such 
functions can be computed by polynomial-sized NAND-CIRC pro- 
grams without using the rand operations. This can be phrased as 
saying that BPSIZE(T(n)) C SIZE(O(nT(n))) (where BPSIZE is 
defined in the natural way using RNAND progams). The stronger 
version of Theorem 20.8 we mentioned can be phrased as saying that 


BPP pory = P poly: 
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ADVANCED TOPICS 


Learning Objectives: 


Definition of perfect secrecy 
The one-time pad encryption scheme 
Necessity of long keys for perfect secrecy 


Computational secrecy and the derandomized 
one-time pad. 


Public key encryption 


A taste of advanced topics 


21 
Cryptography 


“Human ingenuity cannot concoct a cipher which human ingenuity cannot 
resolve.”, Edgar Allen Poe, 1841 


“A good disguise should not reveal the person's height”, Shafi Goldwasser 
and Silvio Micali, 1982 


“Perfect Secrecy” is defined by requiring of a system that after a cryptogram 
is intercepted by the enemy the a posteriori probabilities of this cryptogram rep- 
resenting various messages be identically the same as the a priori probabilities 
of the same messages before the interception. It is shown that perfect secrecy is 
possible but requires, if the number of messages is finite, the same number of 
possible keys.”, Claude Shannon, 1945 


“We stand today on the brink of a revolution in cryptography.”, Whitfeld 
Diffie and Martin Hellman, 1976 


Cryptography - the art or science of “secret writing” - has been 
around for several millennia, and for almost all of that time Edgar 
Allan Poe’s quote above held true. Indeed, the history of cryptography 
is littered with the figurative corpses of cryptosystems believed secure 
and then broken, and sometimes with the actual corpses of those who 
have mistakenly placed their faith in these cryptosystems. 

Yet, something changed in the last few decades, which is the “revo- 
lution” alluded to (and to a large extent initiated by) Diffie and Hell- 
man’s 1976 paper quoted above. New cryptosystems have been found 
that have not been broken despite being subjected to immense efforts 
involving both human ingenuity and computational power on a scale 
that completely dwarves the “code breakers” of Poe’s time. Even more 
amazingly, these cryptosystems are not only seemingly unbreakable, 
but they also achieve this under much harsher conditions. Not only 
do today’s attackers have more computational power, but they also 
have more data to work with. In Poe’s age, an attacker would be lucky 
if they got access to more than a few encryptions of known messages. 
These days attackers might have massive amounts of data—terabytes 
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or more—at their disposal. In fact, with public key encryption, an at- 
tacker can generate as many ciphertexts as they wish. 

The key to this success has been a clearer understanding of both 
how to define security for cryptographic tools and how to relate this 
security to concrete computational problems. Cryptography is a vast 
and continuously changing topic, but we will touch on some of these 
issues in this chapter. 


21.1 CLASSICAL CRYPTOSYSTEMS 


A great many cryptosystems have been devised and broken through- 
out the ages. Let us recount just some of these stories. In 1587, Mary, 
Queen of Scots, and the heir to the throne of England, wanted to ar- 
range the assassination of her cousin, Queen Elizabeth I of England, 
so that she could ascend to the throne and finally escape the house 
arrest under which she had been for the last 18 years. As part of this 
complicated plot, she sent a coded letter to Sir Anthony Babington. 
Mary used what’s known as a substitution cipher where each letter 
is transformed into a different obscure symbol (see Fig. 21.1). Ata 
first look, such a letter might seem rather inscrutable—a meaningless 
sequence of strange symbols. However, after some thought, one might 
recognize that these symbols repeat several times and moreover that 
different symbols repeat with different frequencies. Now it doesn’t 
take a large leap of faith to assume that perhaps each symbol corre- 
sponds to a different letter and the more frequent symbols correspond 


ELEY AG ot ESE GRE DAA Vo cane gS CRRA Sena Zap 
Bapatcen AnA Beane tA paN ER 1 EMogtr st AwonPFnotna es 
ov stgvee OT OEF TE Atvigcaigegec OP om cian farase 
ofp sitar ne geo am np rs. £ ; | 


HonPX tints EmsofErne nopngar?) 
Ceisfrpacmarn ge pani HepyBAopichs Vancengtanonfed 
wenswyiorepaacatga fas ficgamaprefer sh A i 


Figure 21.1: Snippet from encrypted communication 
between queen Mary and Sir Babington 


to letters that occur in the alphabet with higher frequency. From this 
observation, there is a short gap to completely breaking the cipher, 
which was in fact done by Queen Elizabeth’s spies, who used the de- 
coded letters to learn of all the co-conspirators and to convict Queen 
Mary of treason, a crime for which she was executed. Trusting in su- 
perficial security measures (such as using “inscrutable” symbols) is a 
trap that users of cryptography have been falling into again and again 
over the years. (As with many things, this is the subject of a great 
XKCD cartoon, see Fig. 21.2.) 

The Vigenère cipher is named after Blaise de Vigenère, who de- 
scribed it in a book in 1586 (though it was invented earlier by Bellaso). 
The idea is to use a collection of substitution cyphers: if there are n 
different ciphers then the first letter of the plaintext is encoded with 
the first cipher, the second with the second cipher, the n!” with the nt” 
cipher, and then the n+ 1* letter is again encoded with the first cipher. 
The key k is usually a word or a phrase of n letters. The it” substitu- 
tion cipher will shift each letter by the same shift needed to get from A 

to k;. If k; is C, for example, the it” 
letter by two places. This “flattens” the frequencies and makes it much 


substitution cipher will shift every 


harder to do frequency analysis, which is why this cipher was consid- 
ered “unbreakable” for 300+ years and got the nickname “le chiffre 
indéchiffrable” (“the unbreakable cipher”). Nevertheless, Charles 
Babbage cracked the Vigenére cipher in 1854 (though he did not pub- 
lish it). In 1863 Friedrich Kasiski broke the cipher and published the 
result. The idea is that once you guess the length of the cipher, you 
can reduce the task to breaking a simple substitution cipher which can 
be done via frequency analysis (can you see why?). Confederate gen- 
erals used Vigenére regularly during the civil war, and their messages 
were routinely cryptanalzed by Union officers. 

The Enigma cipher was a mechanical cipher (looking like a type- 
writer, see Fig. 21.5) where each letter typed would get mapped into 
a different letter depending on the (rather complicated) key and cur- 
rent state of the machine,d which had several rotors that rotated at 
different paces. An identically wired machine at the other end could 
be used to decrypt. Just as many ciphers in history, this has also been 
believed by the Germans to be “impossible to break” and even quite 
late in the war they refused to believe it was broken despite mount- 
ing evidence to that effect. (In fact, some German generals refused 
to believe it was broken even after the war.) Breaking Enigma was an 
heroic effort which was initiated by the Poles and then completed by 
the British at Bletchley Park, with Alan Turing (of the Turing machine) 
playing a key role. As part of this effort the Brits built arguably the 
world’s first large scale mechanical computation devices (though they 
looked more similar to washing machines than to iPhones). They were 
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FOR ADDED SECURITY, AFTER 
WE ENCRYPT THE DATA STREAM, 


ALAH, DONEHLNI, WE SEND IT THROUGH OUR 

DONEHLIN, ALA'IH, NAVAJO CODE TALKER. 

ALA'IH, DONEHLIN(, IS HE JUST USING 
DONEHLNI, DONEHLIN, | NAVATO WORDS FOR 
AIAH, | ALAIH, "ZERO AND “ONE”? 

DONEHLINI, ALAIH, 

DONEHLIN!, DONENUNI, WHOA, HEY, KEEP 

my YOUR Cs Down! 


ise 


Figure 21.2: XKCD’s take on the added security of 
using uncommon symbols 


Figure 21.3: Confederate Cipher Disk for implement- 
ing the Vigenére cipher 


ias 2 @ 
fr A FG aE 


erciieaM amt ora one connec 


fe tuga FET FEQ XZEN JIOA 
Pre Tes wK paw AVS romes: 7 re Masexcew riyo 


Ann wean co” IYE DWAES BOIPA 


fi v Rb ry Vor ivet 
r 


UM TQRO Nea M DBPL FINEAMTE as Eror a ocasasiavi |, 


J ogdesia Fiwnren ARMITA 


Figure 21.4: Confederate encryption of the message 
“Gen’l Pemberton: You can expect no help from this 
side of the river. Let Gen’l Johnston know, if possible, 
when you can attack the same point on the enemy’s 
lines. Inform me also and I will endeavor to make a 
diversion. I have sent some caps. I subjoin a despatch 
from General Johnston.” 
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also helped along the way by some quirks and errors of the German 
operators. For example, the fact that their messages ended with “Heil 
Hitler” turned out to be quite useful. 

Here is one entertaining anecdote: the Enigma machine would 
never map a letter to itself. In March 1941, Mavis Batey, a cryptana- 
lyst at Bletchley Park received a very long message that she tried to 
decrypt. She then noticed a curious property— the message did not 
contain the letter “L”.! She realized that the probability that no “L’’s 
appeared in the message is too small for this to happen by chance. 
Hence she surmised that the original message must have been com- 
posed only of L's. That is, it must have been the case that the operator, 
perhaps to test the machine, have simply sent out a message where he 
repeatedly pressed the letter “L”. This observation helped her decode 
the next message, which helped inform of a planned Italian attack and 
secure a resounding British victory in what became known as “the 
Battle of Cape Matapan”. Mavis also helped break another Enigma 
machine. Using the information she provided, the Brits were able 
to feed the Germans with the false information that the main allied 
invasion would take place in Pas de Calais rather than on Normandy. 

In the words of General Eisenhower, the intelligence from Bletchley 
Park was of “priceless value”. It made a huge difference for the Allied 
war effort, thereby shortening World War II and saving millions of 
lives. See also this interview with Sir Harry Hinsley. 


21.2 DEFINING ENCRYPTION 


Many of the troubles that cryptosystem designers faced over history 
(and still face!) can be attributed to not properly defining or under- 
standing the goals they want to achieve in the first place. Let us focus 
on the setting of private key encryption. (This is also known as “sym- 
metric encryption”; for thousands of years, “private key encryption” 
was synonymous with encryption and only in the 1970s was the con- 
cept of public key encryption invented, see Definition 21.11.) A sender 
(traditionally called “Alice”) wants to send a message (known also 
as a plaintext) x € {0,1}* toa receiver (traditionally called “Bob”). 
They would like their message to be kept secret from an adversary 
who listens in or “eavesdrops” on the communication channel (and is 
traditionally called “Eve’). 

Alice and Bob share a secret key k € {0,1}*. (While the letter k 
is often used elsewhere in the book to denote a natural number, in 
this chapter we use it to denote the string corresponding to a secret 
key.) Alice uses the key k to “scramble” or encrypt the plaintext x into 
a ciphertext y, and Bob uses the key k to “unscramble” or decrypt the 
ciphertext y back into the plaintext x. This motivates the following 
definition which attempts to capture what it means for an encryption 


Figure 21.5: In the Enigma mechanical cipher the secret 
key would be the settings of the rotors and internal 
wires. As the operator typed up their message, the 
encrypted version appeared in the display area above, 
and the internal state of the cipher was updated (and 
so typing the same letter twice would generally result 
in two different letters output). Decrypting follows 
the same process: if the sender and receiver are using 
the same key then typing the ciphertext would result 
in the plaintext appearing in the display. 

1 Here is a nice exercise: compute (up to an order 

of magnitude) the probability that a 50-letter long 
message composed of random letters will end up not 
containing the letter “L”. 


scheme to be valid or “make sense”, regardless of whether or not it is 
secure: 


Definition 21.1 — Valid encryption scheme. Let L : N —> N and C : N > N 
be two functions mapping natural numbers to natural numbers. 

A pair of polynomial-time computable functions (E£, D) map- 

ping strings to strings is a valid private key encryption scheme (or 
encryption scheme for short) with plaintext length function L(-) and 
ciphertext length function C(-) if foreveryn € N,k € {0,1}” and 
x € {0,1}4™, |E,(x)| = C(n) and 


D(k, E(k,x)) = 2. (21.1) 


We will often write the first input (i.e., the key) to the encryp- 
tion and decryption as a subscript and so can write (21.1) also as 
D,(E,(x)) = z. 


Solved Exercise 21.1 — Lengths of ciphertext and plaintext. Prove that for 
every valid encryption scheme (E, D) with functions L, C. C(n) > 
L(n) for every n. 


Solution: 

For every fixed key k € {0,1}", the equation (21.1) implies that 
the map y |> D,(y) inverts the map x œ> E(x), which in partic- 
ular means that the maps œ> E(x) must be one to one. Hence 
its codomain must be at least as large as its domain, and since its 
domain is {0, 1}/'”) and its codomain is {0, 1}©'”) it follows that 
C(n) > L(n). 


Since the ciphertext length is always at least the plaintext length 
(and in most applications it is not much longer than that), we typi- 
cally focus on the plaintext length as the quantity to optimize in an 
encryption scheme. The larger L(n) is, the better the scheme, since it 
means we need a shorter secret key to protect messages of the same 
length. 


21.3 DEFINING SECURITY OF ENCRYPTION 


Definition 21.1 says nothing about the security of E and D, and even 
allows the trivial encryption scheme that ignores the key altogether 


and sets F(x) = x for every x. Defining security is not a trivial matter. 
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Figure 21.6: A private-key encryption scheme is a 

pair of algorithms E, D such that for every key 

k € {0,1}” and plaintext x € {0, 1}}™, y = E(x) 
is a ciphertext of length C (n). The encryption scheme 
is valid if for every such y, Dą(y) = x. That is, the 
decryption of an encryption of x is x, as long as both 
encryption and decryption use the same key. 
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Throughout history, many attacks on cryptosystems were rooted 
in the cryptosystem designers’ reliance on “security through 
obscurity”— trusting that the fact their methods are not known to 


their enemy will protect them from being broken. This is a faulty 
assumption - if you reuse a method again and again (even with a 
different key each time) then eventually your adversaries will figure 
out what you are doing. And if Alice and Bob meet frequently in a 
secure location to decide on a new method, they might as well take 
the opportunity to exchange their secrets. These considerations led 
Auguste Kerckhoffs in 1883 to state the following principle: 


A cryptosystem should be secure even if everything about the system, except the 
key, is public knowledge. ? The actual quote is “TL faut qu'il mexige pas le 
secret, et qu'il puisse sans inconvénient tomber entre 
Why is it OK to assume the key is secret and not the algorithm? ait E a desea de pl dee by 
Because we can always choose a fresh key. But of course that won’t the enemy without causing trouble”. According to 
Steve Bellovin the NSA version is “assume that the 


oats : first copy of any device we make is shipped to the 
any deterministic algorithm to choose the key then eventually your Kremlin”. 


help us much if our key is “1234” or “passwOrd!”. In fact, if you use 


adversary will figure this out. Therefore for security we must choose 
the key at random and can restate Kerckhoffs’s principle as follows: 


There is no secrecy without randomness 


This is such a crucial point that is worth repeating: 


At the heart of every cryptographic scheme there is a secret key, 
and the secret key is always chosen at random. A corollary of that 
is that to understand cryptography, you need to know probability 
theory. 


uniformly random string. Great care must be taken in 
doing this, and randomness generators often turn out 
to be the Achilles heel of secure systems. 

In 2006 a programmer removed a line of code from the 
procedure to generate entropy in OpenSSL package 
distributed by Debian since it caused a warning in 
some automatic verification code. As a result for two 
years (until this was discovered) all the randomness 
generated by this procedure used only the process 

ID as an “unpredictable” source. This means that all 
communication done by users in that period is fairly 
easily breakable (and in particular, if some entities 
recorded that communication they could break it also 
retroactively). See XKCD’s take on that incident. 

In 2012 two separate teams of researchers scanned a 
large number of RSA keys on the web and found out 
that about 4 percent of them are easy to break. The 
main issue were devices such as routers, internet- 
connected printers and such. These devices sometimes 
run variants of Linux—a desktop operating system— 
but without a hard drive, mouse or keyboard, they 
don’t have access to many of the entropy sources that 
desktops have. Coupled with some good old fash- 
ioned ignorance of cryptography and software bugs, 
this led to many keys that are downright trivial to 
break, see this blog post and this web page for more 
details. 

Since randomness is so crucial to security, breaking 
the procedure to generate randomness can lead to a 
complete break of the system that uses this random- 
ness. Indeed, the Snowden documents, combined with 
observations of Shumow and Ferguson, strongly sug- 
gest that the NSA has deliberately inserted a trapdoor 
in one of the pseudorandom generators published by 
the National Institute of Standards and Technology 
(NIST). Fortunately, this generator wasn’t widely 
adapted, but apparently the NSA did pay 10 million 
dollars to RSA Security so the latter would make this 
generator the default option in their products. 


21.4 PERFECT SECRECY 


If you think about encryption scheme security for a while, you might 
come up with the following principle for defining security: “An 
encryption scheme is secure if it is not possible to recover the key k from 
E,,(x)”. However, a moment’s thought shows that the key is not really 
what we're trying to protect. After all, the whole point of an encryp- 
tion is to protect the confidentiality of the plaintext x. So, we can try to 
define that “an encryption scheme is secure if it is not possible to recover the 
plaintext x from E(x)”. Yet it is not clear what this means either. Sup- 
pose that an encryption scheme reveals the first 10 bits of the plaintext 
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x. It might still not be possible to recover x completely, but on an in- 
tuitive level, this seems like it would be extremely unwise to use such 
an encryption scheme in practice. Indeed, often even partial information 
about the plaintext is enough for the adversary to achieve its goals. 

The above thinking led Shannon in 1945 to formalize the notion of 
perfect secrecy, which is that an encryption reveals absolutely nothing 
about the message. There are several equivalent ways to define it, but 
perhaps the cleanest one is the following: 


Definition 21.3 — Perfect secrecy. A valid encryption scheme (E, D) 
with plaintext length L(-) is perfectly secret if foreveryn € Nand 
plaintexts x,’ € {0,1}/\™, the following two distributions Y and 
Y’ over {0, 1}* are identical: 


e Y is obtained by sampling k ~ {0,1}” and outputting E(x). 


e Y” is obtained by sampling k ~ {0,1}” and outputting E(x’). 


21.4.1 Example: Perfect secrecy in the battlefield 

To understand Definition 21.3, suppose that Alice sends only one of 
two possible messages: “attack” or “retreat”, which we denote by xy 
and x, respectively, and that she sends each one of those messages 
with probability 1/2. Let us put ourselves in the shoes of Eve, the 
eavesdropping adversary. A priori we would have guessed that Alice 
sent either x9 or x, with probability 1/2. Now we observe y = E;,,(z;) 
where k is a uniformly chosen key in {0, 1}". How does this new 
information cause us to update our beliefs on whether Alice sent the 
plaintext x, or the plaintext x4? 


Figure 21.7: For any key length n, we can visualize an 
encryption scheme (E, D) as a graph with a vertex 
for every one of the 24" possible plaintexts and for 
every one of the ciphertexts in {0, 1}* of the form 
E(x) for k € {0,1}" and x € {0,1}4(™. For every 
plaintext x and key k, we add an edge labeled k 
between x and E(x). By the validity condition, if we 
pick any fixed key k, the map x +» E;,,(a) must be 
one-to-one. The condition of perfect secrecy simply 
corresponds to requiring that every two plaintexts 

a and x’ have exactly the same set of neighbors (or 
multi-set, if there are parallel edges). 


Let us define p9(y) to be the probability (taken over k ~ {0,1}”) 
that y = E,(xo) and similarly p; (y) to be Pr,.p9 nly = E,(x1)]- 
Note that, since Alice chooses the message to send at random, our 
a priori probability for observing y is $p9(y) + $p,(y). However, 
as per Definition 21.3, the perfect secrecy condition guarantees that 
Poly) = p(y)! Let us denote the number p(y) = p(y) by p. By the 
formula for conditional probability, the probability that Alice sent the 
message x, conditioned on our observation y is simply 


Prẹi = 0 Ay = Ex(a,)] 


Prli = Oly = E,(#,)] = Pr[y = E,(2)] 


(21.2) 


(The equation (21.2) is a special case of Bayes’ rule which, although 
a simple restatement of the formula for conditional probability, is 
an extremely important and widely used tool in statistics and data 
analysis.) 

Since the probability that 1 = 0 and y is the ciphertext E,,(0) is equal 
to $ - po(y), and the a priori probability of observing y is $p9(y) + 
ip, (y), we can rewrite (21.2) as 


. ZPol) p _1 
Pre Oly Eel Fu) + wa) PPD 
using the fact that po(y) = p;(y) = p. This means that observing the 
ciphertext y did not help us at all! We still would not be able to guess 
whether Alice sent “attack” or “retreat” with better than 50/50 odds! 
This example can be vastly generalized to show that perfect secrecy 
is indeed “perfect” in the sense that observing a ciphertext gives Eve 
no additional information about the plaintext beyond her a priori knowl- 
edge. 


21.4.2 Constructing perfectly secret encryption 
Perfect secrecy is an extremely strong condition, and implies that an 
eavesdropper does not learn any information from observing the ci- 
phertext. You might think that an encryption scheme satisfying such a 
strong condition will be impossible, or at least extremely complicated, 
to achieve. However it turns out we can in fact obtain a perfectly secret 
encryption scheme fairly easily. Such a scheme for two-bit messages is 
illustrated in Fig. 21.8. 

In fact, this can be generalized to any number of bits: 
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Figure 21.8: A perfectly secret encryption scheme 

for two-bit keys and messages. The blue vertices 
represent plaintexts and the red vertices represent 
ciphertexts, each edge mapping a plaintext x to a ci- 
phertext y = E,,(a) is labeled with the corresponding 
key k. Since there are four possible keys, the degree of 
the graph is four and it is in fact a complete bipartite 
graph. The encryption scheme is valid in the sense 
that for every k € {0,1}?, the map z + E;,,(2) is 
one-to-one, which in other words means that the set 
of edges labeled with k is a matching. 
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Theorem 21.4 — One Time Pad (Vernam 1917, Shannon 1949). There is a per- 
fectly secret valid encryption scheme (E, D) with L(n) = C(n) =n. 


Proof Idea: 

Our scheme is the one-time pad also known as the “Vernam Ci- 
pher”, see Fig. 21.9. The encryption is exceedingly simple: to encrypt 
a message x € {0,1}" witha key k € {0,1}" we simply output z @ k 
where @ is the bitwise XOR operation that outputs the string corre- 
sponding to XORing each coordinate of x and k. 

* 


Proof of Theorem 21.4. For two binary strings a and b of the same 
length n, we define a @ b to be the stringe € {0,1}” such that 
ci = a; +b; mod 2 foreveryi € [n]. The encryption scheme 
(E, D) is defined as follows: E,(x) = x ®kand D,(y) = y ® k. 
By the associative law of addition (which works also modulo two), 
D,(E,(2)) = (x @k) @k =x @(k@®k) = z G0" = z, using the fact 
that for every bit o € {0,1}, +0 mod 2 = Oand o +0 =ø mod 2. 
Hence (E, D) form a valid encryption. 

To analyze the perfect secrecy property, we claim that for every 
x € {0,1}”, the distribution Y, = E(x) where k ~ {0,1}” is simply 
the uniform distribution over {0, 1}”, and hence in particular the 
distributions Y,, and Y, are identical for every x, x’ € {0,1}”. Indeed, 
for every particular y € {0,1}", the value y is output by Y, if and 
only ify = x ® k which holds if and only if k = x © y. Since k is 
chosen uniformly at random in {0, 1}”, the probability that k happens 
to equal x © y is exactly 27”, which means that every string y is output 
by Y„ with probability 27”. 

a 


Plaintext: 0 1 1 o 1 o o o 1 1 0 1 


Ciphertext: 1 1 o = x o $ A 0 1 o o 


Figure 21.9: In the one time pad encryption scheme we 
encrypt a plaintext x € {0, 1}” with a key k € {0,1}” 
by the ciphertext x © k where & denotes the bitwise 
XOR operation. 


21.5 NECESSITY OF LONG KEYS 


So, does Theorem 21.4 give the final word on cryptography, and 
means that we can all communicate with perfect secrecy and live 
happily ever after? No it doesn’t. While the one-time pad is efficient, 
and gives perfect secrecy, it has one glaring disadvantage: to commu- 
nicate n bits you need to store a key of length n. In contrast, practically 
used cryptosystems such as AES-128 have a short key of 128 bits (ie., 
16 bytes) that can be used to protect terabytes or more of communica- 
tion! Imagine that we all needed to use the one time pad. If that was 
the case, then if you had to communicate with m people, you would 
have to maintain (securely!) m huge files that are each as long as the 
length of the maximum total communication you expect with that per- 
son. Imagine that every time you opened an account with Amazon, 
Google, or any other service, they would need to send you in the mail 
(ideally with a secure courier) a DVD full of random numbers, and 
every time you suspected a virus, you'd need to ask all these services 
for a fresh DVD. This doesn’t sound so appealing. 

This is not just a theoretical issue. The Soviets have used the one- 
time pad for their confidential communication since before the 1940's. 
In fact, even before Shannon’s work, the U.S. intelligence already 
knew in 1941 that the one-time pad is in principle “unbreakable” (see 
page 32 in the Venona document). However, it turned out that the 
hassle of manufacturing so many keys for all the communication took 
its toll on the Soviets and they ended up reusing the same keys for 
more than one message. They did try to use them for completely dif- 
ferent receivers in the (false) hope that this wouldn’t be detected. The 
Venona Project of the U.S. Army was founded in February 1943 by 
Gene Grabeel (see Fig. 21.10), a former home economics teacher from 
Madison Heights, Virgnia and Lt. Leonard Zubko. In October 1943, 
they had their breakthrough when it was discovered that the Russians 
were reusing their keys. In the 37 years of its existence, the project has 
resulted in a treasure chest of intelligence, exposing hundreds of KGB 
agents and Russian spies in the U.S. and other countries, including 
Julius Rosenberg, Harry Gold, Klaus Fuchs, Alger Hiss, Harry Dexter 
White and many others. 

Unfortunately it turns out that such long keys are necessary for 
perfect secrecy: 
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Figure 21.10: Gene Grabeel, who founded the U.S. 
Russian SigInt program on 1 Feb 1943. Photo taken in 
1942, see Page 7 in the Venona historical study. 


hainterts cipht-bok 


Figure 21.11: An encryption scheme where the num- 
ber of keys is smaller than the number of plaintexts 
corresponds to a bipartite graph where the degree is 
smaller than the number of vertices on the left side. 
Together with the validity condition this implies that 
there will be two left vertices x, x’ with non-identical 
neighborhoods, and hence the scheme does not satisfy 
perfect secrecy. 
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Theorem 21.5 — Perfect secrecy requires long keys. For every perfectly 
secret encryption scheme (FE, D) the length function L satisfies 
L(n) <n. 


Proof Idea: 

The idea behind the proof is illustrated in Fig. 21.11. We define a 
graph between the plaintexts and ciphertexts, where we put an edge 
between plaintext x and ciphertext y if there is some key k such that 
y = E(x). The degree of this graph is at most the number of potential 
keys. The fact that the degree is smaller than the number of plaintexts 
(and hence of ciphertexts) implies that there would be two plaintexts 
x and x’ with different sets of neighbors, and hence the distribution 
of a ciphertext corresponding to x (with a random key) will not be 
identical to the distribution of a ciphertext corresponding to x’. 

* 


Proof of Theorem 21.5. Let E, D bea valid encryption scheme with 

messages of length L and key of length n < L. We will show that 

(E, D) is not perfectly secret by providing two plaintexts £o, xı € 

{0,1}* such that the distributions Y,, and Y,,, are not identical, where 

Y, is the distribution obtained by picking k ~ {0,1}” and outputting 
We choose zy = 0”. Let Sọ C {0,1}* be the set of all ciphertexts 

that have non-zero probability of being output in Y, . That is, Sọ = 


{y | dreto,13¥ = E;,(Xo) }- Since there are only 2” keys, we know that 
Sal < 2. 

We will show the following claim: 

Claim I: There exists some x, € {0,1}” and k € {0, 1}” such that 
E,,(X1) E So. 

Claim I implies that the string E,,(x,) has positive probability of 
being output by Y,, and zero probability of being output by Y,, and 
hence in particular Y, and Y, are not identical. To prove Claim I, just 
choose a fixed k € {0,1}". By the validity condition, the map xz t+ 
E(x) isa one to one map of {0, 1}” to {0, 1}* and hence in particular 


the image of this map which is the set J, = {y | Isejo, Y = Ex (x)} 
has size at least (in fact exactly) 2”. Since |Sp| < 2” < 2”, this means 
that |I,,| > |S,| and so in particular there exists some string y in I;, \ So- 
But by the definition of I, this means that there is some x € {0, 1}” 
such that E,,(2) ¢ Sọ which concludes the proof of Claim I and hence 
of Theorem 21.5. 

a 
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21.6 COMPUTATIONAL SECRECY 


To sum up the previous episodes, we now know that: 


e It is possible to obtain a perfectly secret encryption scheme with key 
length the same as the plaintext. 


and 


e It is not possible to obtain such a scheme with key that is even a 
single bit shorter than the plaintext. 


How does this mesh with the fact that, as we’ve already seen, peo- 
ple routinely use cryptosystems with a 16 byte (i.e., 128 bit) key but 
many terabytes of plaintext? The proof of Theorem 21.5 does give in 
fact a way to break all these cryptosystems, but an examination of this 
proof shows that it only yields an algorithm with time exponential in 
the length of the key. This motivates the following relaxation of perfect 
secrecy to a condition known as “computational secrecy”. Intuitively, 
an encryption scheme is computationally secret if no polynomial time 
algorithm can break it. The formal definition is below: 


Definition 21.6 — Computational secrecy. Let (E, D) be a valid encryp- 
tion scheme where for keys of length n, the plaintexts are of length 
L(n) and the ciphertexts are of length m(n). We say that (E, D) is 
computationally secret if for every polynomial p : N — N, and large 
enough n, if P is an m(n)-input and single output NAND-CIRC 
program of at most p(n) lines, and x9, 2, € {0,1}“”) then 


(21.3) 
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Definition 21.6 raises two natural questions: 


e Is it strong enough to ensure that a computationally secret encryp- 
tion scheme protects the secrecy of messages that are encrypted 
with it? 


e It is weak enough that, unlike perfect secrecy, it is possible to obtain 
a computationally secret encryption scheme where the key is much 
smaller than the message? 


To the best of our knowledge, the answer to both questions is Yes. 
This is just one example of a much broader phenomenon. We can 


use computational hardness to achieve many cryptographic goals, 
including some goals that have been dreamed about for millenia, and 
other goals that people have not even dared to imagine. 


Regarding the first question, it is not hard to show that if, for ex- 
ample, Alice uses a computationally secret encryption algorithm to 
encrypt either “attack” or “retreat” (each chosen with probability 
1/2), then as long as she’s restricted to polynomial-time algorithms, an 
adversary Eve will not be able to guess the message with probability 
better than, say, 0.51, even after observing its encrypted form. (We 
omit the proof, but it is an excellent exercise for you to work it out on 
your own.) 

To answer the second question we will show that under the same 
assumption we used for derandomizing BPP, we can obtain a com- 
putationally secret cryptosystem where the key is almost exponentially 
smaller than the plaintext. 


21.6.1 Stream ciphers or the “derandomized one-time pad” 
It turns out that if pseudorandom generators exist as in the optimal 
PRG conjecture, then there exists a computationally secret encryption 
scheme with keys that are much shorter than the plaintext. The con- 
struction below is known as a stream cipher, though perhaps a better 
name is the “derandomized one-time pad”. It is widely used in prac- 
tice with keys on the order of a few tens or hundreds of bits protecting 
many terabytes or even petabytes of communication. 

We start by recalling the notion of a pseudorandom generator, as de- 
fined in Definition 20.9. For this chapter, we will fix a special case of 
the definition: 


Plaintext: 


Ciphertext: 


Figure 21.12: In a stream cipher or “derandomized 
one-time pad” we use a pseudorandom generator 

G : {0, 1}” — {0, 1}* to obtain an encryption scheme 
with a key length of n and plaintexts of length L. 

We encrypt the plaintext x € {0,1} with key 

k € {0, 1}” by the ciphertext x @ G(k). 
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Definition 21.7 — Cryptographic pseudorandom generator. Let L : N > N be 
some function. A cryptographic pseudorandom generator with stretch 
L(-) is a polynomial-time computable function G : {0,1}* — {0,1}* 
such that: 


e Forevery n € Nand s € {0,1}”, |G(s)| = L(n). 


e For every polynomial p : N — N and n large enough, if C is a cir- 
cuit of L(n) inputs, one output, and at most p(n) gates then 


1 
REO) a Ce al ea 


In this chapter we will call a cryptographic pseudorandom gener- 
ator simply a pseudorandom generator or PRG for short. The optimal 
PRG conjecture of Section 20.4.2 implies that there is a pseudoran- 
dom generator that can “fool” circuits of exponential size and where 
the gap in probabilities is at most one over an exponential quantity. 
Since exponential grow faster than every polynomial, the optimal PRG 
conjecture implies the following: 


The crypto PRG conjecture: For every a € N, there is a cryptographic 
pseudorandom generator with L(n) = n°. 


The crypto PRG conjecture is a weaker conjecture than the optimal 
PRG conjecture, but it too (as we will see) is still stronger than the 
conjecture that P # NP. 


Theorem 21.8 — Derandomized one-time pad. Suppose that the crypto 
PRG conjecture is true. Then for every constanta € WN there isa 
computationally secret encryption scheme (E, D) with plaintext 
length L(n) at least n°. 


Proof Idea: 

The proof is illustrated in Fig. 21.12. We simply take the one-time 
pad on L bit plaintexts, but replace the key with G(k) where k is a 
string in {0,1}" and G : {0,1}" — {0,1}” is a pseudorandom gen- 
erator. Since the one time pad cannot be broken, an adversary that 
breaks the derandomized one-time pad can be used to distinguish 
between the output of the pseudorandom generator and the uniform 
distribution. 

x 


Proof of Theorem 21.8. LetG : {0,1}" — {0,1} for L = n° be the 
restriction to input length n of the pseudorandom generator G whose 
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existence we are guaranteed from the crypto PRG conjecture. We 


now define our encryption scheme as follows: given key k € {0,1}” 

and plaintext x € {0,1}”, the encryption E(x) is simply x @ G(k). 

To decrypt a string y € {0,1} we output y ® G(k). This is a valid 

encryption since G is computable in polynomial time and (x @ G(k)) ® 

G(k) = x @ (G(k) ® G(k)) = a for every x € {0,1}”. 
Computational secrecy follows from the condition of a pseudo- 


random generator. Suppose, towards a contradiction, that there is 


a polynomial p, NAND-CIRC program Q of at most p(L) lines and 


x, x’ € {0,1}4™) such that 


[Q(F,.())] 


k~{0,1}” 


k~{0,1}” 


QE) TR 


(We use here the simple fact that for a {0, 1}-valued random variable 


X, Pr[X = 1] = E[X}.) 


By the definition of our encryption scheme, this means that 


[Q(G(k) © x)] 


ka{O,1}" 


B kw{0,1}" 


[Q(G(k) ® x’) 


1 


(21.4) 


Now since (as we saw in the security analysis of the one-time pad), 


for every strings x,x’ € {0,1}”, the distribution r @ x and r @ 2’ are 


identical, where r ~ {0,1}”. Hence 


Qr e 


refs} 


x)| = 


rv{0,1}© 


Qr @2")). 


By plugging (21.5) into (21.4) we can derive that 


[Q(G(k) @ x)] — 


k~{0,1}” 


r 
rv{0,1}E 


[Qr e 


x)| + 


r{0,1}E 


[Qr ® 


(Please make sure that you can see why this is true.) 


Now we can use the triangle inequality that 
every two numbers A, B, applying it for A = 
E.~(0,1}1Q(r@z)| and B = Feno 1p [Rre |= 


to derive 


[Q(G(k) @ x)] — 


k~{0,1}” 


r~{0,1}4 


Qr e 


oft 


rv{0,L}E 


[Q(r® 


x)|] — 


x’)] 


(21.5) 


k~{0,1}” 
(21.6) 


A+ B| < |A| + |B] for 
Uk~{0,1}” [Q(G(k) ® x)| — 
Exn{0,1}» [Q(G(k) 2’)] 


7 k~{0.1}" 
(21.7) 


In particular, either the first term or the second term of the left- 


hand side of (21.7) must be at least 


1 


2p( 


Ty" 


Let us assume the first case 


holds (the second case is analyzed in exactly the same way). Then we 


get that 


[Q(G(k) ® x)] — 


k~{0,1}" 


rf0,1} 


(21.8) 


But if we now define the NAND-CIRC program P, that on input 
r € {0,1}” outputs Q(r@z) then (since XOR of L bits can be computed 
in O(L) lines), we get that P, has p(L) + O(L) lines and by (21.8) it 
can distinguish between an input of the form Gk) and an input of the 
form r ~ {0, 1}" with advantage better than 77. Since a polynomial 
is dominated by an exponential, if we make L large enough, this will 
contradict the (2°, 27°”) security of the pseudorandom generator G. 


(R) 


21.7 COMPUTATIONAL SECRECY AND NP 


We’ve also mentioned before that an efficient algorithm for NP could 
be used to break all cryptography. We now give an example of how 
this can be done: 


Theorem 21.10 — Breaking encryption using NP algorithm. If P = NP 
then there is no computationally secret encryption scheme with 
L(n)>n. 

Furthermore, for every valid encryption scheme (E, D) with 
L(n) > n + 100 there is a polynomial p such that for every large 
enough n there exist 79,2, € {0,1}"(™ anda p(n)-line NAND- 
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CIRC program EVE s.t. 


E ee) = o = 0.99), 

Note that the “furthermore” part is extremely strong. It means 
that if the plaintext is even a little bit larger than the key, then we can 
already break the scheme in a very strong way. That is, there will be 
a pair of messages x, x, (think of x as “sell” and x, as “buy”) and 
an efficient strategy for Eve such that if Eve gets a ciphertext y then 
she will be able to tell whether y is an encryption of xo or x with 
probability very close to 1. (We model breaking the scheme as Eve 
outputting 0 or 1 corresponding to whether the message sent was x 
or xı. Note that we could have just as well modified Eve to output x 
instead of 0 and x, instead of 1. The key point is that a priori Eve only 
had a 50/50 chance of guessing whether Alice sent xo or x, but after 
seeing the ciphertext this chance increases to better than 99/1.) The 
condition P = NP can be relaxed to NP C BPP and even the weaker 
condition NP C P pory with essentially the same proof. 


Proof Idea: 

The proof follows along the lines of Theorem 21.5 but this time 
paying attention to the computational aspects. If P = NP then for 
every plaintext x and ciphertext y, we can efficiently tell whether there 
exists k € {0,1}” such that E(x) = y. So, to prove this result we need 
to show that if the plaintexts are long enough, there would exist a pair 
£o, zı such that the probability that a random encryption of x4 also is 
a valid encryption of 2g will be very small. The details of how to show 
this are below. 

* 


Proof of Theorem 21.10. We focus on showing only the “furthermore” 
part since it is the more interesting and the other part follows by es- 
sentially the same proof. 

Suppose that (E, D) is such an encryption, let n be large enough, 
and let xo = 04. For every x € {0,1}/'”) we define S, to be the set 
of all valid encryptions of x. That is S, = {y | drerorjny = Ey (x)}. As 


in the proof of Theorem 21.5, since there are 2” keys k, |S,,| < 2” for 
every x € {0,1}4™. 

We denote by So the set S,- We define our algorithm EVE to out- 
put 0 on input y € {0,1}* ify € Sp and to output 1 otherwise. This 
can be implemented in polynomial time if P = NP, since the key k 
can serve the role of an efficiently verifiable solution. (Can you see 
why?) Clearly Pr[EVE(E;,(%))) = 0] = 1 and so in the case that EVE 
gets an encryption of x, then she guesses correctly with probability 
1. The remainder of the proof is devoted to showing that there ex- 
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ists x, € {0,1}4™ such that Pr[EVE(E,(a2,)) = 0] < 0.01, which 
will conclude the proof by showing that EVE guesses wrongly with 
probability at most $0 + 40.01 < 0.01. 

Consider now the following probabilistic experiment (which we 
define solely for the sake of analysis). We consider the sample space 
of choosing x uniformly in {0,1}“'” and define the random variable 
Z,,(x) to equal 1 if and only if E(x) € So. For every k, the map z > 
E,,(x) is one-to-one, which means that the probability that Z, = 1 


is equal to the probability that x € Ex'(S) which is #224. So by the 


linearity of expectation E[} peto,» Zn] < zol a 


We will now use the following extremely simple but useful fact 
known as the averaging principle (see also Lemma 18.10): for every 


random variable Z, if E[Z] = p, then with positive probability Z < n. 
(Indeed, if Z > u with probability one, then the expected value of Z 
will have to be larger than p, just like you can’t have a class in which 
all students got A or A- and yet the overall average is B+.) In our case 
it means that with positive probability } 7, - toj» Zk S Z In other 
words, there exists some x, € {0,1}/'” such that ee Z,(@1) < 
as Yet this means that if we choose a random k ~ {0,1}”, then 
the probability that E,(x,) € So is at most 55 - 
So, in particular if we have an algorithm EVE that outputs 0 if x € 

So and outputs 1 otherwise, then Pr[EVE(E,,(x,))) = 0] = land 
Pr[EVE(E,(2)) = 0] < 2”-4™ which will be smaller than 271° < 0.01 
if L(n) > n+ 10. 


ae = gn—L(n), 


In retrospect Theorem 21.10 is perhaps not surprising. After all, as 
we've mentioned before it is known that the Optimal PRG conjecture 
(which is the basis for the derandomized one-time pad encryption) is 
false if P = NP (and in fact even if NP C BPP or even NP C P pi). 


21.8 PUBLIC KEY CRYPTOGRAPHY 


People have been dreaming about heavier-than-air flight since at least 
the days of Leonardo Da Vinci (not to mention Icarus from the greek 
mythology). Jules Verne wrote with rather insightful details about 
going to the moon in 1865. But, as far as I know, in all the thousands 
of years people have been using secret writing, until about 50 years 
ago no one has considered the possibility of communicating securely 
without first exchanging a shared secret key. 

Yet in the late 1960’s and early 1970's, several people started to 
question this “common wisdom”. Perhaps the most surprising of 
these visionaries was an undergraduate student at Berkeley named 
Ralph Merkle. In the fall of 1974 Merkle wrote in a project proposal 
for his computer security course that while “it might seem intuitively 
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obvious that if two people have never had the opportunity to prear- 
range an encryption method, then they will be unable to communicate 
securely over an insecure channel... I believe it is false”. The project 
proposal was rejected by his professor as “not good enough”. Merkle 
later submitted a paper to the communication of the ACM where he 
apologized for the lack of references since he was unable to find any 
mention of the problem in the scientific literature, and the only source 
where he saw the problem even raised was in a science fiction story. 
The paper was rejected with the comment that “Experience shows that 
it is extremely dangerous to transmit key information in the clear.” 
Merkle showed that one can design a protocol where Alice and Bob 
can use T invocations of a hash function to exchange a key, but an 
adversary (in the random oracle model, though he of course didn’t 
use this name) would need roughly T? invocations to break it. He 
conjectured that it may be possible to obtain such protocols where 
breaking is exponentially harder than using them, but could not think of 
any concrete way to doing so. 

We only found out much later that in the late 1960’s, a few years 
before Merkle, James Ellis of the British Intelligence agency GCHOQ 
was having similar thoughts. His curiosity was spurred by an old 
World-War II manuscript from Bell Labs that suggested the following 
way that two people could communicate securely over a phone line. 
Alice would inject noise to the line, Bob would relay his messages, 
and then Alice would subtract the noise to get the signal. The idea is 
that an adversary over the line sees only the sum of Alice’s and Bob’s 
signals, and doesn’t know what came from what. This got James Ellis 
thinking whether it would be possible to achieve something like that 
digitally. As Ellis later recollected, in 1970 he realized that in princi- 
ple this should be possible, since he could think of an hypothetical 
black box B that on input a “handle” a and plaintext x would give a 
“ciphertext” y and that there would be a secret key p corresponding 
to a, such that feeding 8 and y to the box would recover x. However, 
Ellis had no idea how to actually instantiate this box. He and others 
kept giving this question as a puzzle to bright new recruits until one 
of them, Clifford Cocks, came up in 1973 with a candidate solution 
loosely based on the factoring problem; in 1974 another GCHQ re- 
cruit, Malcolm Williamson, came up with a solution using modular 
exponentiation. 

But among all those thinking of public key cryptography, probably 
the people who saw the furthest were two researchers at Stanford, 
Whit Diffie and Martin Hellman. They realized that with the advent 
of electronic communication, cryptography would find new applica- 
tions beyond the military domain of spies and submarines, and they 
understood that in this new world of many users and point to point 


communication, cryptography will need to scale up. Diffie and Hell- 
man envisioned an object which we now call “trapdoor permutation” 
though they called “one way trapdoor function” or sometimes simply 
“public key encryption”. Though they didn’t have full formal defini- 
tions, their idea was that this is an injective function that is easy (e.g., 
polynomial-time) to compute but hard (e.g., exponential-time) to in- 
vert. However, there is a certain trapdoor, knowledge of which would 
allow polynomial time inversion. Diffie and Hellman argued that us- 
ing such a trapdoor function, it would be possible for Alice and Bob 
to communicate securely without ever having exchanged a secret key. But 
they didn’t stop there. They realized that protecting the integrity of 
communication is no less important than protecting its secrecy. Thus 
they imagined that Alice could “run encryption in reverse” in order to 
certify or sign messages. 

At the point, Diffie and Hellman were in a position not unlike 
physicists who predicted that a certain particle should exist but with- 
out any experimental verification. Luckily they met Ralph Merkle, 
and his ideas about a probabilistic key exchange protocol, together with 
a suggestion from their Stanford colleague John Gill, inspired them 
to come up with what today is known as the Diffie Hellman Key Ex- 
change (which unbeknownst to them was found two years earlier at 
GCHQ by Malcolm Williamson). They published their paper “New 
Directions in Cryptography” in 1976, and it is considered to have 
brought about the birth of modern cryptography. 

The Diffie-Hellman Key Exchange is still widely used today for 
secure communication. However, it still felt short of providing Diffie 
and Hellman’s elusive trapdoor function. This was done the next year 
by Rivest, Shamir and Adleman who came up with the RSA trapdoor 
function, which through the framework of Diffie and Hellman yielded 
not just encryption but also signatures. (A close variant of the RSA 
function was discovered earlier by Clifford Cocks at GCHQ, though 
as far as I can tell Cocks, Ellis and Williamson did not realize the 
application to digital signatures.) From this point on began a flurry of 
advances in cryptography which hasn’t died down till this day. 


21.8.1 Defining public key encryption 
A public key encryption consists of a triple of algorithms: 


e The key generation algorithm, which we denote by KeyGen or KG for 
short, is a randomized algorithm that outputs a pair of strings (e, d) 
where e is known as the public (or encryption) key, and d is known 
as the private (or decryption) key. The key generation algorithm gets 
as input 1” (i.e., a string of ones of length n). We refer to n as the 
security parameter of the scheme. The bigger we make n, the more 
secure the encryption will be, but also the less efficient it will be. 
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Figure 21.13: Top left: Ralph Merkle, Martin Hellman 
and Whit Diffie, who together came up in 1976 

with the concept of public key encryption and a key 
exchange protocol. Bottom left: Adi Shamir, Ron Rivest, 
and Leonard Adleman who, following Diffie and 
Hellman’s paper, discovered the RSA function that 
can be used for public key encryption and digital 
signatures. Interestingly, one can see the equation 

P = NP on the blackboard behind them. Right: John 
Gill, who was the first person to suggest to Diffie and 
Hellman that they use modular exponentiation as an 
easy-to-compute but hard-to-invert function. 
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e The encryption algorithm, which we denote by E, takes the encryp- 
tion key e and a plaintext x, and outputs the ciphertext y = E,(a). 


e The decryption algorithm, which we denote by D, takes the decryp- 
tion key d and a ciphertext y, and outputs the plaintext x = D,(y). 


We now make this a formal definition: 


Definition 21.11 — Public Key Encryption. A computationally secret public 
key encryption with plaintext length L : N — Nisa triple of ran- 
domized polynomial-time algorithms (KG, E, D) that satisfy the 
following conditions: 


e For every n, if (e, d) is output by KG(1”) with positive proba- 
bility, and x € {0,1}“™, then D,(E.(x)) = x with probability 
one. 


e For every polynomial p, and sufficiently large n, if P is a NAND- 
CIRC program of at most p(n) lines then for every x, x’ € 
{0,1}, |E[P(e, E.(2))] — E[P(e, E,(2’))||_ < 1/p(n), where 
this probability is taken over the coins of KG and £E. 


Definition 21.11 allows E and D to be randomized algorithms. In 
fact, it turns out that it is necessary for E to be randomized to obtain 
computational secrecy. It also turns out that, unlike the private key 
case, we can transform a public-key encryption that works for mes- 
sages that are only one bit long into a public-key encryption scheme 
that can encrypt arbitrarily long messages, and in particular messages 
that are longer than the key. In particular this means that we cannot ob- 
tain a perfectly secret public-key encryption scheme even for one-bit 
long messages (since it would imply a perfectly secret public-key, and 
hence in particular private-key, encryption with messages longer than 
the key). 

We will not give full constructions for public key encryption 
schemes in this chapter, but will mention some of the ideas that 
underlie the most widely used schemes today. These generally belong 
to one of two families: 


e Group theoretic constructions based on problems such as integer factor- 
ing and the discrete logarithm over finite fields or elliptic curves. 


e Lattice/coding based constructions based on problems such as the 
closest vector in a lattice or bounded distance decoding. 


Group-theory based encryptions such as the RSA cryptosystem, the 
Diffie-Hellman protocol, and Elliptic-Curve Cryptography, are cur- 
rently more widely implemented. But the lattice/coding schemes are 
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Figure 21.14: In a public key encryption, Alice generates 
a private/public keypair (e, d), publishes e and keeps 
d secret. To encrypt a message for Alice, one only 
needs to know e. To decrypt it we need to know d. 


recently on the rise, particularly because the known group theoretic 
encryption schemes can be broken by quantum computers, which we 
discuss in Chapter 23. 


21.8.2 Diffie-Hellman key exchange 

As just one example of how public key encryption schemes are con- 
structed, let us now describe the Diffie-Hellman key exchange. We 
describe the Diffie-Hellman protocol in a somewhat of an informal 
level, without presenting a full security analysis. 

The computational problem underlying the Diffie Hellman protocol 
is the discrete logarithm problem. Let's suppose that g is some integer. 
We can compute the map x ++ g” and also its inverse y tb log, y. (For 
example, we can compute a logarithm is by binary search: start with 


some interval [x 


Tmar 


in that is guaranteed to contain log, y. We can 
then test whether the interval’s midpoint z,,,,, satisfies g”™:4 > y, and 
based on that halve the size of the interval.) 

However, suppose now that we use modular arithmetic and work 
modulo some prime number p. If p has n binary digits and g is in [p] 
then we can compute the map «++ g” mod p in time polynomial in n. 
(This is not trivial, and is a great exercise for you to work this out; as a 
hint, start by showing that one can compute the map k ++ g?” mod p 
using k modular multiplications modulo p, if you're stumped, you 
can look up this Wikipedia entry.) On the other hand, because of the 
“wraparound” property of modular arithmetic, we cannot run binary 
search to find the inverse of this map (known as the discrete logarithm). 
In fact, there is no known polynomial-time algorithm for computing 
this discrete logarithm map (g, £, p) œ> log, a mod p, where we define 
log, x mod pas the number a € |p] such that g* = x mod p. 

The Diffie-Hellman protocol for Bob to send a message to Alice is as 
follows: 


e Alice: Chooses p to be a random n bit long prime (which can be 
done by choosing random numbers and running a primality testing 
algorithm on them), and g and a at random in [p]. She sends to Bob 
the triple (p, g, g° mod p). 


e Bob: Given the triple (p, g, h), Bob sends a message x € {0,1}” 
to Alice by choosing b at random in [p], and sending to Alice the 
pair (g9? mod p,rep(h? mod p) © x) where rep : [p] > {0,1} 
is some “representation function” that maps [p] to {0,1}”. (The 
function rep does not need to be one-to-one and you can think of 
rep(z) as simply outputting L of the bits of z in the natural binary 
representation, it does need to satisfy certain technical conditions 
which we omit in this description.) 
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e Alice: Given g’, z, Alice recovers x by outputting rep(g’* mod p) ® 


Z. 


The correctness of the protocol follows from the simple fact that 
(g*)’ = (g?)° for every g, a,b and this still holds if we work modulo 
p. Its security relies on the computational assumption that computing 
this map is hard, even in a certain “average case” sense (this computa- 
tional assumption is known as the Decisional Diffie Hellman assump- 
tion). The Diffie-Hellman key exchange protocol can be thought of as 
a public key encryption where Alice’s first message is the public key, 
and Bob’s message is the encryption. 

One can think of the Diffie-Hellman protocol as being based on 
a “trapdoor pseudorandom generator” where the triple g, g’, g% 
looks “random” to someone that doesn’t know a, but someone that 
does know a can see that raising the second element to the a-th power 
yields the third element. The Diffie-Hellman protocol can be described 
abstractly in the context of any finite Abelian group for which we can 
efficiently compute the group operation. It has been implemented 
on other groups than numbers modulo p, and in particular Elliptic 
Curve Cryptography (ECC) is obtained by basing the Diffie Hell- 
man on elliptic curve groups which gives some practical advantages. 
Another common group theoretic basis for key-exchange/public key 
encryption protocol is the RSA function. A big disadvantage of Diffie- 
Hellman (both the modular arithmetic and elliptic curve variants) 
and RSA is that both schemes can be broken in polynomial time by a 
quantum computer. We will discuss quantum computing later in this 
course. 


21.9 OTHER SECURITY NOTIONS 


There is a great deal to cryptography beyond just encryption schemes, 
and beyond the notion of a passive adversary. A central objective 

is integrity or authentication: protecting communications from being 
modified by an adversary. Integrity is often more fundamental than 
secrecy: whether it is a software update or viewing the news, you 
might often not care about the communication being secret as much as 
that it indeed came from its claimed source. Digital signature schemes 
are the analog of public key encryption for authentication, and are 
widely used (in particular as the basis for public key certificates) to 
provide a foundation of trust in the digital world. 

Similarly, even for encryption, we often need to ensure security 
against active attacks, and so notions such as non-malleability and 
adaptive chosen ciphertext security have been proposed. An encryp- 
tion scheme is only as secure as the secret key, and mechanisms to 
make sure the key is generated properly, and is protected against re- 
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fresh or even compromise (i.e., forward secrecy) have been studied as 
well. Hopefully this chapter provides you with some appreciation for 
cryptography as an intellectual field, but does not imbue you with a 
false self confidence in implementing it. 

Cryptographic hash functions are another widely used tool with a 
variety of uses, including extracting randomness from high entropy 
sources, achieving hard-to-forge short “digests” of files, protecting 
passwords, and much more. 


21.10 MAGIC 


Beyond encryption and signature schemes, cryptographers have man- 
aged to obtain objects that truly seem paradoxical and “magical”. We 
briefly discuss some of these objects. We do not give any details, but 
hopefully this will spark your curiosity to find out more. 


21.10.1 Zero knowledge proofs 

On October 31, 1903, the mathematician Frank Nelson Cole 

gave an hourlong lecture to a meeting of the American Mathe- 
matical Society where he did not speak a single word. Rather, 

he calculated on the board the value 267 — 1 which is equal to 

147, 573, 952, 589, 676, 412, 927, and then showed that this number is 
equal to 193,707,721 x 761,838, 257, 287. Cole’s proof showed that 
267 — 1 is not a prime, but it also revealed additional information, 
namely its actual factors. This is often the case with proofs: they teach 
us more than just the validity of the statements. 

In Zero Knowledge Proofs we try to achieve the opposite effect. We 
want a proof for a statement X where we can rigorously show that the 
proofs reveals absolutely no additional information about X beyond the 
fact that it is true. This turns out to be an extremely useful object for 
a variety of tasks including authentication, secure protocols, voting, 
anonymity in cryptocurrencies, and more. Constructing these ob- 
jects relies on the theory of NP completeness. Thus this theory that 
originally was designed to give a negative result (show that some prob- 
lems are hard) ended up yielding positive applications, enabling us to 
achieve tasks that were not possible otherwise. 


21.10.2 Fully homomorphic encryption 

Suppose that we are given a bit-by-bit encryption of a string 
E;,(%o),---, £4 (Lp_1)- By design, these ciphertexts are supposed to 
be “completely unscrutable” and we should not be able to extract 
any information about «,’s from it. However, already in 1978, Rivest, 
Adleman and Dertouzos observed that this does not imply that we 
could not manipulate these encryptions. For example, it turns out the 
security of an encryption scheme does not immediately rule out the 


607 


608 INTRODUCTION TO THEORETICAL COMPUTER SCIENCE 


ability to take a pair of encryptions F,,(a) and E;,,(b) and compute 
from them E,,(aNANDb) without knowing the secret key k. But do there 
exist encryption schemes that allow such manipulations? And if so, is 
this a bug or a feature? 

Rivest et al already showed that such encryption schemes could 
be immensely useful, and their utility has only grown in the age of 
cloud computing. After all, if we can compute NAND then we can 
use this to run any algorithm P on the encrypted data, and map 
Ej, (9), +», Eg (€n_1) to E,( P(x, ...,£,__1)). For example, a client 
could store their secret data x in encrypted form on the cloud, and 
have the cloud provider perform all sorts of computation on these 
data without ever revealing to the provider the private key, and so 
without the provider ever learning any information about the secret 
data. 

The question of existence of such a scheme took much longer time 
to resolve. Only in 2009 Craig Gentry gave the first construction of 
an encryption scheme that allows to compute a universal basis of 
gates on the data (known as a Fully Homomorphic Encryption scheme in 
crypto parlance). Gentry’s scheme left much to be desired in terms of 
efficiency, and improving upon it has been the focus of an intensive 
research program that has already seen significant improvements. 


21.10.3 Multiparty secure computation 

Cryptography is about enabling mutually distrusting parties to 
achieve a common goal. Perhaps the most general primitive achiev- 
ing this objective is secure multiparty computation. The idea in secure 
multiparty computation is that n parties interact together to compute 


some function F'(%,...,2n_1 


) where z; is the private input of the i-th 
party. The crucial point is that there is no commonly trusted party or 
authority and that nothing is revealed about the secret data beyond the 
function’s output. One example is an electronic voting protocol where 
only the total vote count is revealed, with the privacy of the individual 
voters protected, but without having to trust any authority to either 
count the votes correctly or to keep information confidential. Another 
example is implementing a second price (aka Vickrey) auction where 
n — 1 parties submit bids to an item owned by the n-th party, and the 
item goes to the highest bidder but at the price of the second highest bid. 
Using secure multiparty computation we can implement second price 
auction in a way that will ensure the secrecy of the numerical values 
of all bids (including even the top one) except the second highest one, 
and the secrecy of the identity of all bidders (including even the sec- 
ond highest bidder) except the top one. We emphasize that such a 
protocol requires no trust even in the auctioneer itself, who will also 
not learn any additional information. Secure multiparty computation 


can be used even for computing randomized processes, with one exam- 
ple being playing Poker over the net without having to trust any server 


for correct shuffling of cards or not revealing the information. 


21.11 EXERCISES 
21.12 BIBLIOGRAPHICAL NOTES 


Much of this text is taken from my lecture notes on cryptography. 

Shannon’s manuscript was written in 1945 but was classified, and a 
partial version was only published in 1949. Still it has revolutionized 
cryptography, and is the forerunner to much of what followed. 

The Venona project’s history is described in this document. Aside 
from Grabeel and Zubko, credit to the discovery that the Soviets were 
reusing keys is shared by Lt. Richard Hallock, Carrie Berry, Frank 
Lewis, and Lt. Karl Elmquist, and there are others that have made 
important contribution to this project. See pages 27 and 28 in the 
document. 

In a 1955 letter to the NSA that only recently came forward, John 
Nash proposed an “unbreakable” encryption scheme. He wrote “I 
hope my handwriting, etc. do not give the impression I am just a crank or 
circle-squarer.... The significance of this conjecture | that certain encryption 
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schemes are exponentially secure against key recovery attacks] .. is that it is 
quite feasible to design ciphers that are effectively unbreakable.”. John Nash 
made seminal contributions in mathematics and game theory, and was 
awarded both the Abel Prize in mathematics and the Nobel Memorial 
Prize in Economic Sciences. However, he has struggled with mental 
illness throughout his life. His biography, A Beautiful Mind was made 
into a popular movie. It is natural to compare Nash’s 1955 letter to the 
NSA to Gédel’s letter to von Neumann we mentioned before. From 
the theoretical computer science point of view, the crucial difference 

is that while Nash informally talks about exponential vs polynomial 
computation time, he does not mention the word “Turing machine” or 
other models of computation, and it is not clear if he is aware or not 
that his conjecture can be made mathematically precise (assuming a 
formalization of “sufficiently complex types of enciphering”). 

The definition of computational secrecy we use is the notion of 
computational indistinguishability (known to be equivalent to semantic 
security) that was given by Goldwasser and Micali in 1982. 

Although they used a different terminology, Diffie and Hellman 
already made clear in their paper that their protocol can be used as 
a public key encryption, with the first message being put in a “pub- 
lic file”. In 1985, ElGamal showed how to obtain a signature scheme 
based on the Diffie Hellman ideas, and since he described the Diffie- 
Hellman encryption scheme in the same paper, the public key encryp- 
tion scheme originally proposed by Diffie and Hellman is sometimes 
also known as ElGamal encryption. 

My survey contains a discussion on the different types of public key 
assumptions. While the standard elliptic curve cryptographic schemes 
are as susceptible to quantum computers as Diffie-Hellman and RSA, 
their main advantage is that the best known classical algorithms for 
computing discrete logarithms over elliptic curve groups take time 2°” 
for some e > 0 where n is the number of bits to describe a group ele- 
ment. In contrast, for the multiplicative group modulo a prime p the 


‘“polylog(n)) which means that (assum- 


best algorithm take time 20” 
ing the known algorithms are optimal) we need to set the prime to be 
bigger (and so have larger key sizes with corresponding overhead in 
communication and computation) to get the same level of security. 

Zero-knowledge proofs were constructed by Goldwasser, Micali, 
and Rackoff in 1982, and their wide applicability was shown (using 
the theory of NP completeness) by Goldreich, Micali, and Wigderson 
in 1986. 

Two party and multiparty secure computation protocols were con- 
structed (respectively) by Yao in 1982 and Goldreich, Micali, and 
Wigderson in 1987. The latter work gave a general transformation 
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from security against passive adversaries to security against active 
adversaries using zero knowledge proofs. 


22 
Proofs and algorithms 


“Let's not try to define knowledge, but try to define zero-knowledge.”, Shafi 
Goldwasser. 


Proofs have captured human imagination for thousands of years, 
ever since the publication of Euclid’s Elements, a book second only to 
the bible in the number of editions printed. 

Plan: 


e Proofs and algorithms 
e Interactive proofs 
e Zero knowledge proofs 


e Propositions as types, Cog and other proof assistants. 


22.1 EXERCISES 
22.2 BIBLIOGRAPHICAL NOTES 
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Learning Objectives: 


e See main aspects in which quantum 
mechanics differs from local deterministic 
theories. 


Model of quantum circuits, or equivalently 
QNAND.-CIRC programs 

The complexity class BOP and what we know 
about its relation to other classes 


Ideas behind Shor’s Algorithm and the 
2 3 Quantum Fourier Transform 


Quantum computing 


“We always have had (secret, secret, close the doors!) ... a great deal of diffi- 
culty in understanding the world view that quantum mechanics represents ... 
It has not yet become obvious to me that there's no real problem. ... Can I learn 
anything from asking this question about computers—about this may or may 
not be mystery as to what the world view of quantum mechanics is?” , Richard 
Feynman, 1981 


“The only difference between a probabilistic classical world and the equations 
of the quantum world is that somehow or other it appears as if the probabilities 
would have to go negative”, Richard Feynman, 1981 


There were two schools of natural philosophy in ancient Greece. 
Aristotle believed that objects have an essence that explains their behav- 
ior, and a theory of the natural world has to refer to the reasons (or “fi- 
nal cause” to use Aristotle’s language) as to why they exhibit certain 
phenomena. Democritus believed in a purely mechanistic explanation 
of the world. In his view, the universe was ultimately composed of 
elementary particles (or Atoms) and our observed phenomena arise 
from the interactions between these particles according to some local 
rules. Modern science (arguably starting with Newton) has embraced 
Democritus’ point of view, of a mechanistic or “clockwork” universe 
of particles and forces acting upon them. 

While the classification of particles and forces evolved with time, 
to a large extent the “big picture” has not changed from Newton till 
Einstein. In particular it was held as an axiom that if we knew fully 
the current state of the universe (i.e., the particles and their properties 
such as location and velocity) then we could predict its future state at 
any point in time. In computational language, in all these theories the 
state of a system with n particles could be stored in an array of O(n) 
numbers, and predicting the evolution of the system can be done by 
running some efficient (e.g., poly(n) time) deterministic computation 
on this array. 
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23.1 THE DOUBLE SLIT EXPERIMENT 


Alas, in the beginning of the 20th century, several experimental re- 
sults were calling into question this “clockwork” or “billiard ball” 
theory of the world. One such experiment is the famous double slit ex- 
periment. Here is one way to describe it. Suppose that we buy one of 
those baseball pitching machines, and aim it at a soft plastic wall, but 
put a metal barrier with a single slit between the machine and the plastic 
wall (see Fig. 23.1). If we shoot baseballs at the plastic wall, then some 
of the baseballs would bounce off the metal barrier, while some would 
make it through the slit and dent the wall. If we now carve out an ad- 
ditional slit in the metal barrier then more balls would get through, 
and so the plastic wall would be even more dented. 

So far this is pure common sense, and it is indeed (to my knowl- 
edge) an accurate description of what happens when we shoot base- 
balls at a plastic wall. However, this is not the same when we shoot 
photons. Amazingly, if we shoot with a “photon gun” (i.e., a laser) at 
a wall equipped with photon detectors through some barrier, then 
(as shown in Fig. 23.2) in some positions of the wall we will see fewer 
hits when the two slits are open than when only one of them is.! In 
particular there are positions in the wall that are hit when the first slit 
is open, hit when the second slit is open, but are not hit at all when both 
slits are open! 

It seems as if each photon coming out of the gun is aware of the 
global setup of the experiment, and behaves differently if two slits are 
open than if only one is. If we try to “catch the photon in the act” and 
place a detector right next to each slit so we can see exactly the path 
each photon takes then something even more bizarre happens. The 
mere fact that we measure the path changes the photon’s behavior, and 
now this “destructive interference” pattern is gone and the number 
of times a position is hit when two slits are open is the sum of the 


number of times it is hit when each slit is open. 


23.2 QUANTUM AMPLITUDES 


The double slit and other experiments ultimately forced scientists to 
accept a very counterintuitive picture of the world. It is not merely 
about nature being randomized, but rather it is about the probabilities 
in some sense “going negative” and cancelling each other! 


Single Slit: Double Slit: 


SY ; 


Figure 23.1: In the “double baseball experiment” we 
shoot baseballs from a gun at a soft wall through a 
hard barrier that has one or two slits open in it. There 
is only “constructive interference” in the sense that 
the dent in each position in the wall when both slits 
are open is the sum of the dents when each slit is 
open on its own. 


1 A nice illustrated description of the double slit 
experiment appears in this video. 
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Figure 23.2: The setup of the double slit experiment 
in the case of photon or electron guns. We see also 
destructive interference in the sense that there are 
some positions on the wall that get fewer hits when 
both slits are open than they get when only one of the 
slits is open. Image credit: Wikipedia. 
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To see what we mean by this, let us go back to the baseball exper- 
iment. Suppose that the probability a ball passes through the left slit 
is pz and the probability that it passes through the right slit is pp. 
Then, if we shoot N balls out of each gun, we expect the wall will be 
hit (p; + pR) N times. In contrast, in the quantum world of photons 
instead of baseballs, it can sometimes be the case that in both the first 
and second case the wall is hit with positive probabilities p; and pp 
respectively but somehow when both slits are open the wall (or a par- 
ticular position in it) is not hit at all. It’s almost as if the probabilities 
can “cancel each other out”. 

To understand the way we model this in quantum mechanics, it is 
helpful to think of a “lazy evaluation” approach to probability. We 
can think of a probabilistic experiment such as shooting a baseball 
through two slits in two different ways: 


e When a ball is shot, “nature” tosses a coin and decides if it will go 
through the left slit (which happens with probability p; ), right slit 
(which happens with probability pp), or bounce back. If it passes 
through one of the slits then it will hit the wall. Later we can look at 
the wall and find out whether or not this event happened, but the 
fact that the event happened or not is determined independently of 
whether or not we look at the wall. 


e The other viewpoint is that when a ball is shot, “nature” computes 
the probabilities pz, and pp as before, but does not yet “toss the 
coin” and determines what happened. Only when we actually 
look at the wall, nature tosses a coin and with probability p; + pp 
ensures we see a dent. That is, nature uses “lazy evaluation”, and 
only determines the result of a probabilistic experiment when we 
decide to measure it. 


While the first scenario seems much more natural, the end result 
in both is the same (the wall is hit with probability pz, + pz) and so 
the question of whether we should model nature as following the first 
scenario or second one seems like asking about the proverbial tree that 
falls in the forest with no one hearing it. 

However, when we want to describe the double slit experiment 
with photons rather than baseballs, it is the second scenario that lends 
itself better to a quantum generalization. Quantum mechanics as- 
sociates a number a known as an amplitude with each probabilistic 
experiment. This number a can be negative, and in fact even complex. 
We never observe the amplitudes directly, since whenever we mea- 
sure an event with amplitude a, nature tosses a coin and determines 
that the event happens with probability |a|”. However, the sign (or 
in the complex case, phase) of the amplitudes can affect whether two 
different events have constructive or destructive interference. 
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Specifically, consider an event that can either occur or not (e.g. “de- 
tector number 17 was hit by a photon”). In classical probability, we 
model this by a probability distribution over the two outcomes: a pair 
of non-negative numbers p and q such that p + q = 1, where p corre- 
sponds to the probability that the event occurs and q corresponds to 
the probability that the event does not occur. In quantum mechanics, 
we model this also by pair of numbers, which we call amplitudes. This 
is a pair of (potentially negative or even complex) numbers a and 8 
such that |a|? + |8|? = 1. The probability that the event occurs is |a|? 
and the probability that it does not occur is ||. In isolation, these 
negative or complex numbers don’t matter much, since we square 
them anyway to obtain probabilities. But the interaction of positive 
and negative amplitudes can result in surprising cancellations where 
somehow combining two scenarios where an event happens with 
positive probability results in a scenario where it never does. 


Quantum mechanics is a mathematical theory that allows us to 
calculate and predict the results of the double-slit and many other ex- 
periments. If you think of quantum mechanics as an explanation as to 
what “really” goes on in the world, it can be rather confusing. How- 


ever, if you simply “shut up and calculate” then it works amazingly 
well at predicting experimental results. In particular, in the double 
slit experiment, for any position in the wall, we can compute num- 
bers a and £ such that photons from the first and second slit hit that 
position with probabilities |a|? and |8|? respectively. When we open 
both slits, the probability that the position will be hit is proportional 
to |a + 8|?, and so in particular, if a = —£ then it will be the case that, 
despite being hit when either slit one or slit two are open, the position 
is not hit at all when they both are. If you are confused by quantum 
mechanics, you are not alone: for decades people have been trying to 
come up with explanations for “the underlying reality” behind quan- 
tum mechanics, including Bohmian Mechanics, Many Worlds and 
others. However, none of these interpretations have gained universal 
acceptance and all of those (by design) yield the same experimental 
predictions. Thus at this point many scientists prefer to just ignore the 
question of what is the “true reality” and go back to simply “shutting 
up and calculating”. 


23.3 BELL'S INEQUALITY 


There is something weird about quantum mechanics. In 1935 Einstein, 
Podolsky and Rosen (EPR) tried to pinpoint this issue by highlighting 
a previously unrealized corollary of this theory. They showed that 

the idea that nature does not determine the results of an experiment 
until it is measured results in so called “spooky action at a distance”. 
Namely, making a measurement of one object may instantaneously 
effect the state (i.e., the vector of amplitudes) of another object in the 
other end of the universe. 

Since the vector of amplitudes is just a mathematical abstraction, 
the EPR paper was considered to be merely a thought experiment for 
philosophers to be concerned about, without bearing on experiments. 
This changed when in 1965 John Bell showed an actual experiment 
to test the predictions of EPR and hence pit intuitive common sense 
against the quantum mechanics. Quantum mechanics won: it turns 
out that it is in fact possible to use measurements to create correlations 
between the states of objects far removed from one another that cannot 
be explained by any prior theory. Nonetheless, since the results of 
these experiments are so obviously wrong to anyone that has ever sat 
in an armchair, that there are still a number of Bell denialists arguing 
that this can’t be true and quantum mechanics is wrong. 

So, what is this Bell’s Inequality? Suppose that Alice and Bob try 
to convince you they have telepathic ability, and they aim to prove it 
via the following experiment. Alice and Bob will be in separate closed 
rooms.” You will interrogate Alice and your associate will interrogate 
Bob. You choose a random bit x € {0,1} and your associate chooses 
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? If you are extremely paranoid about Alice and Bob 
communicating with one another, you can coordinate 
with your assistant to perform the experiment exactly 
at the same time, and make sure that the rooms 

are sufficiently far apart (e.g., are on two different 
continents, or maybe even one is on the moon and 
another is on earth) so that Alice and Bob couldn't 
communicate to each other in time even if they do so 
at the speed of light. 
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a random y € {0,1}. We let a be Alice’s response and b be Bob’s 
response. We say that Alice and Bob win this experiment if a @ b = 
x A^ y. In other words, Alice and Bob need to output two bits that 
disagree if « = y = 1 and agree otherwise.’ 

Now if Alice and Bob are not telepathic, then they need to agree in 
advance on some strategy. It’s not hard for Alice and Bob to succeed 
with probability 3/4: just always output the same bit. Moreover, by 
doing some case analysis, we can show that no matter what strategy 
they use, Alice and Bob cannot succeed with higher probability than 
that:* 


Theorem 23.2 — Bell’s Inequality. For every two functions f,g : {0,1} > 
{0,1}, Pre yeto} lf (£) ® gy) = x Ay] < 3/4. 


Proof. Since the probability is taken over all four choices of x,y € 
{0, 1}, the only way the theorem can be violated if if there exist two 
functions f, g that satisfy 


f(x) gly) =aAy 


for all the four choices of x,y € {0,1}?. Let’s plug in all these four 
choices and see what we get (below we use the equalities z @ 0 = z, 
zA\0=O0andzAl=z): 


f(0)@g(0) =0 (plugging in x = 0, y = 0) 
f(0)@g91) =0 (plugging in x = 0,y = 1) 
f(1)@g9(0) =0 (plugging in x = 1,y = 0) 
f)@g0) =1 (plugging in x = 1,y = 1) 


If we XOR together the first and second equalities we get g(0) ® 
g(1) = 0 while if we XOR together the third and fourth equalities we 
get g(0) ®© g(1) = 1, thus obtaining a contradiction. 

a 


An amazing experimentally verified fact is that quantum mechanics 
allows for “telepathy”.° Specifically, it has been shown that using the 
weirdness of quantum mechanics, there is in fact a strategy for Alice 
and Bob to succeed in this game with probability larger than 3/4 (in 
fact, they can succeed with probability about 0.85, see Lemma 23.5). 


23.4 QUANTUM WEIRDNESS 


Some of the counterintuitive properties that arise from quantum me- 
chanics include: 


e Interference - As we've seen, quantum amplitudes can “cancel each 
other out”. 


3 This form of Bell’s game was shown by Clauser, 
Horne, Shimony, and Holt. 


* Theorem 23.2 below assumes that Alice and Bob 

use deterministic strategies f and g respectively. More 
generally, Alice and Bob could use a randomized 
strategy, or equivalently, each could choose f and 

g from some distributions F and G respectively. 
However the averaging principle (Lemma 18.10) 
implies that if all possible deterministic strategies 
succeed with probability at most 3/4, then the same is 
true for all randomized strategies. 


5 More accurately, one either has to give up ona 
“billiard ball type” theory of the universe or believe 
in telepathy (believe it or not, some scientists went for 
the latter option). 


e Measurement - The idea that amplitudes are negative as long as 
“no one is looking” and “collapse” (by squaring them) to positive 
probabilities when they are measured is deeply disturbing. Indeed, 
as shown by EPR and Bell, this leads to various strange outcomes 
such as “spooky actions at a distance”, where we can create corre- 
lations between the results of measurements in places far removed. 
Unfortunately (or fortunately?) these strange outcomes have been 
confirmed experimentally. 


e Entanglement - The notion that two parts of the system could be 
connected in this weird way where measuring one will affect the 
other is known as quantum entanglement. 


As counter-intuitive as these concepts are, they have been experi- 
mentally confirmed, so we just have to live with them. 


23.5 QUANTUM COMPUTING AND COMPUTATION - AN EXECUTIVE 
SUMMARY. 


One of the strange aspects of the quantum-mechanical picture of the 
world is that unlike in the billiard ball example, there is no obvious 
algorithm to simulate the evolution of n particles over t time periods 
in poly(n, t) steps. In fact, the natural way to simulate n quantum par- 
ticles will require a number of steps that is exponential in n. This is a 
huge headache for scientists that actually need to do these calculations 
in practice. 

In the 1981, physicist Richard Feynman proposed to “turn this 
lemon to lemonade” by making the following almost tautological 
observation: 
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If a physical system cannot be simulated by a computer in T steps, the 
system can be considered as performing a computation that would take 
more than T steps. 


So, he asked whether one could design a quantum system such that 
its outcome y based on the initial condition x would be some function 
y = f(x) such that (a) we don’t know how to efficiently compute 
in any other way, and (b) is actually useful for something.® In 1985, 
David Deutsch formally suggested the notion of a quantum Turing 
machine, and the model has been since refined in works of Deutsch 
and Josza and Bernstein and Vazirani. Such a system is now known as 
a quantum computer. 

For a while these hypothetical quantum computers seemed useful 
for one of two things. First, to provide a general-purpose mecha- 
nism to simulate a variety of the real quantum systems that people 
care about, such as various interactions inside molecules in quantum 
chemistry. Second, as a challenge to the Extended Church Turing hypoth- 
esis which says that every physically realizable computation device 
can be modeled (up to polynomial overhead) by Turing machines (or 
equivalently, NAND-TM / NAND-RAM programs). 

Quantum chemistry is important (and in particular understand- 
ing it can be a bottleneck for designing new materials, drugs, and 
more), but it is still a rather niche area within the broader context of 
computing (and even scientific computing) applications. Hence for a 
while most researchers (to the extent they were aware of it), thought 
of quantum computers as a theoretical curiosity that has little bear- 
ing to practice, given that this theoretical “extra power” of quantum 
computer seemed to offer little advantage in the majority of the prob- 
lems people want to solve in areas such as combinatorial optimization, 
machine learning, data structures, etc.. 

To some extent this is still true today. As far as we know, quantum 
computers, if built, will not provide exponential speed ups for 95% 
of the applications of computing.” In particular, as far as we know, 
quantum computers will not help us solve NP complete problems in 
polynomial or even sub-exponential time, though Grover’s algorithm ( 
Remark 23.4) does yield a quadratic advantage in many cases. 

However, there is one cryptography-sized exception: In 1994 Peter 
Shor showed that quantum computers can solve the integer factoring 
and discrete logarithm problems in polynomial time. This result has 
captured the imagination of a great many people, and completely 
energized research into quantum computing. This is both because the 
hardness of these particular problems provides the foundations for 
securing such a huge part of our communications (and these days, 
our economy), and because it was a powerful demonstration that 


6 As its title suggests, Feynman’s lecture was actually 
focused on the other side of simulating physics with 
a computer. However, he mentioned that as a “side 
remark” one could wonder if it’s possible to simulate 
physics with a new kind of computer - a “quantum 
computer” which would “not [be] a Turing machine, 
but a machine of a different kind”. As far as I know, 
Feynman did not suggest that such a computer could 
be useful for computations completely outside the 
domain of quantum simulation. Indeed, he was 
more interested in the question of whether quantum 
mechanics could be simulated by a classical computer. 


7 This “95 percent” is a figure of speech, but not com- 
pletely so. At the time of this writing, cryptocurrency 
mining electricity consumption is estimated to use 

up at least 70Twh or 0.3 percent of the world’s pro- 
duction, which is about 2 to 5 percent of the total 
energy usage for the computing industry. All the 
current cryptocurrencies will be broken by quantum 
computers. Also, for many web servers the TLS pro- 
tocol (which is based on the current non-lattice based 
systems and would be completely broken by quantum 
computing) is responsible for about 1 percent of the 
CPU usage. 


quantum computers could turn out to be useful for problems that 
a-priori seemd to have nothing to do with quantum physics. 

As we'll discuss later, at the moment there are several intensive 
efforts to construct large scale quantum computers. It seems safe 
to say that, as far as we know, in the next five years or so there will 
not be a quantum computer large enough to factor, say, a 1024 bit 
number. On the other hand, it does seem quite likely that in the very 
near future there will be quantum computers which achieve some task 
exponentially faster than the best-known way to achieve the same 
task with a classical computer. When and if a quantum computer is 
built that is strong enough to break reasonable parameters of Diffie 
Hellman, RSA and elliptic curve cryptography is anybody’s guess. It 
could also be a “self destroying prophecy” whereby the existence of 
a small-scale quantum computer would cause everyone to shift away 
to lattice-based crypto which in turn will diminish the motivation 
to invest the huge resources needed to build a large scale quantum 
computer.® 


23.6 QUANTUM SYSTEMS 


Before we talk about quantum computing, let us recall how we phys- 
ically realize “vanilla” or classical computing. We model a logical bit 
that can equal 0 or a 1 by some physical system that can be in one of 
two states. For example, it might be a wire with high or low voltage, 
charged or uncharged capacitor, or even (as we saw) a pipe with or 
without a flow of water, or the presence or absence of a soldier crab. A 
classical system of n bits is composed of n such “basic systems”, each 
of which can be in either a “zero” or “one” state. We can model the 
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8 Of course, given that we're still hearing of attacks 
exploiting “export grade” cryptography that was 
supposed to disappear in the 1990’s, I imagine that 
we'll still have products running 1024 bit RSA when 
everyone has a quantum laptop. 
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state of such a system by a string s € {0,1}”. If we perform an op- 
eration such as writing to the 17-th bit the NAND of the 3rd and 5th 
bits, this corresponds to applying a local function to s such as setting 
S17 = l — S3 ` S5. 

In the probabilistic setting, we would model the state of the system 
by a distribution. For an individual bit, we could model it by a pair of 
non-negative numbers a, 3 such that a + 8 = 1, where a is the prob- 
ability that the bit is zero and £ is the probability that the bit is one. 
For example, applying the negation (i.e., NOT) operation to this bit 
corresponds to mapping the pair (a, 8) to (8, a) since the probability 
that NOT(c) is equal to 1 is the same as the probability that ø is equal 
to 0. This means that we can think of the NOT function as the linear 


map N : R? — R? such that N (5) = ( or equivalently as the 


matrix 


If we think of the n-bit system as a whole, then since the n bits can 
take one of 2” possible values, we model the state of the system as a 
vector p of 2” probabilities. For every s € {0,1}", we denote by e, 
the 2” dimensional vector that has 1 in the coordinate correspond- 
ing to s (identifying it with a number in [2"]), and so can write p as 
Dm (0,1}" Ps€s where p, is the probability that the system is in the state 
s. 

Applying the operation above of setting the 17-th bit to the NAND 
of the 3rd and 5th bits, corresponds to transforming the vector p to 


the vector Fp where F : R?” — R?” is the linear map that maps e, to r 
9 ? Since {e,}scto,1}» is a basis for R?” , it sufficies to 
© 59-816 (1—83°85)818°"Sn—1" define the map F on vectors of this form. 


23.6.1 Quantum amplitudes 

In the quantum setting, the state of an individual bit (or “qubit”, 

to use quantum parlance) is modeled by a pair of numbers (a, 8) 

such that |a|? + |8|? = 1. While in general these numbers can be 
complex, for the rest of this chapter, we will often assume they are 
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real (though potentially negative), and hence often drop the absolute 
value operator. (This turns out not to make much of a difference in 
explanatory power.) As before, we think of a? as the probability that 
the bit equals 0 and 8? as the probability that the bit equals 1. As we 
did before, we can model the NOT operation by the map N : R? — R? 
where N(a, 3) = (8, a). 

Following quantum tradition, instead of using eg and e} as we did 
above, from now on we will denote the vector (1, 0) by |0) and the 
vector (0, 1) by |1) (and moreover, think of these as column vectors). 
This is known as the Dirac “ket” notation. This means that NOT is 
the unique linear map N : R? — R? that satisfies N|0) = |1) and 
N|1) = |0). In other words, in the quantum case, as in the probabilistic 
case, NOT corresponds to the matrix 


x=; an 


In classical computation, we typically think that there are only two 
operations that we can do on a single bit: keep it the same or negate 
it. In the quantum setting, a single bit operation corresponds to any 
linear map OP : R? — R? that is norm preserving in the sense that 


for every a, p, if we apply OP to the vector j then we obtain a 


vector | such that a’? + 8’? = a? + B?. Such a linear map OP 


10 As we mentioned, quantum mechanics actuall 
corresponds to a unitary two by two matrix. 1? Keeping the bit the a 4 


models states as vectors with complex coordinates. 
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: 1 0 However, this does not make any qualitative differ- 
same corresponds to the matrix I = a and (as we’ve seen) the ence to our discussion. 
f : 0 1 
NOT operations corresponds to the matrix N = iyl But there 


are other operations we can use as well. One such useful operation is 
the Hadamard operation, which corresponds to the matrix 


H H 
H=} 
v2 & =) 


In fact it turns out that Hadamard is all that we need to add to a 


classical universal basis to achieve the full power of quantum comput- 
ing. 


23.6.2 Recap 

The state of a quantum system of n qubits is modeled by an 2” dimen- 
sional vector % of unit norm (i.e., squares of all coordinates sums up 
to 1), which we write as y = Seti p |£} where |z) is the col- 
umn vector that has 0 in all coordinates except the one corresponding 
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to x (identifying {0,1}” with the numbers {0,..., 2” — 1}). We use 
the convention that if a, b are strings of lengths k and £ respectively 
then we can write the 2+“ dimensional vector with 1 in the ab-th 
coordinate and zero elsewhere not just as |ab) but also as |a)|b). In 
particular, for every x € {0,1}", we can write the vector |x} also as 
|%9)|" 1) +++ |@p__1). This notation satisfies certain nice distributive laws 
such as |a)(|b) + |b’))|c) = |abc) + |ab’c). 

A quantum operation on such a system is modeled by a 2” x 2” 
unitary matrix U (one that satisfies UU! = I where U7 is the transpose 
operation, or conjugate transpose for complex matrices). If the system 
is in state y and we apply to it the operation U, then the new state of 
the system is Uy. 

When we measure an n-qubit system in a state y = Deo pale), 
then we observe the value x € {0,1}" with probability |Y,„|”. In this 
case, the system collapses to the state |x}. 


23.7 ANALYSIS OF BELL'S INEQUALITY (OPTIONAL) 


Now that we have the notation in place, we can show a strategy for 
Alice and Bob to display “quantum telepathy” in Bell’s Game. Re- 
call that in the classical case, Alice and Bob can succeed in the “Bell 
Game” with probability at most 3/4 = 0.75. We now show that quan- 


11 1 The strategy we show is not the best one. Alice and 
tum mechanics allows them to succeed with probability at least 0.8. Bob can in fact succeed with probability cos? (7/8) ~ 


0.854. 
Lemma 23.5 There is a 2-qubit quantum state % € C4 so that if Alice 


has access to the first qubit of y, can manipulate and measure it and 
output a € {0,1} and Bob has access to the second qubit of y and can 
manipulate and measure it and outputb € {0,1} then Prla @ b = 

xz ^y] > 0.8. 


Proof. Alice and Bob will start by preparing a 2-qubit quantum system 
in the state 


Y = -zl100) + 511) 


(this state is known as an EPR pair). Alice takes the first qubit of 
the system to her room, and Bob takes the qubit to his room. Now, 
when Alice receives x if x = 0 she does nothing and if x = 1 she ap- 
plies the unitary map R_,/g to her qubit where Ry = ae! ra) 
is the unitary operation corresponding to rotation in the plane with 
angle 6. When Bob receives y, if y = 0 he does nothing and if y = 1 
he applies the unitary map R,,/g to his qubit. Then each one of them 
measures their qubit and sends this as their response. 

Recall that to win the game Bob and Alice want their outputs to 


be more likely to differ if x = y = 1 and to be more likely to agree 
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otherwise. We will split the analysis in one case for each of the four 
possible values of x and y. 

Case 1: x = Oandy = 0. Ifx = y = 0 then the state does not 
change. Because the state w is proportional to |00) + |11), the measure- 
ments of Bob and Alice will always agree (if Alice measures 0 then the 
state collapses to |00) and so Bob measures 0 as well, and similarly for 
1). Hence in the case x = y = 0, Alice and Bob always win. 

Case 2: x = Oandy = 1. Ifx = Oandy = 1 then after Alice 
measures her bit, if she gets 0 then the system collapses to the state 
|00), in which case after Bob performs his rotation, his qubit is in 
the state cos(7/8)|0) + sin(7/8)|1). Thus, when Bob measures his 
qubit, he will get 0 (and hence agree with Alice) with probability 
cos?(7/8) > 0.85. Similarly, if Alice gets 1 then the system collapses 
to |11), in which case after rotation Bob’s qubit will be in the state 
—sin(7/8)|0) + cos(7/8)|1) and so once again he will agree with Alice 
with probability cos? (7/8). 

The analysis for Case 3, where x = 1 andy = 0, is completely 
analogous to Case 2. Hence Alice and Bob will agree with probability 


1.2 


i A 1? We are using the (not too hard) observation that 
cos?(7/8) in this case as well. 


the result of this experiment is the same regardless of 


Case 4: x = landy = 1. For the case thatx = landy = 1, the order in which Alice and Bob apply their rotations 
after both Alice and Bob perform their rotations, the state will be and mpasirements: 


proportional to 


R_r/8|0)Rz/sl0}) + R-z/sl1)Rz/sl|1) i (23.1) 


Intuitively, since we rotate one state by 45 degrees and the other 
state by -45 degrees, they will become orthogonal to each other, and 
the measurements will behave like independent coin tosses that agree 
with probability 1/2. However, for the sake of completeness, we now 
show the full calculation. 

Opening up the coefficients and using cos(—x) = cos(x) and 


sin(—x) = — sin(x), we can see that (23.1) is proportional to 
cos? (7 /8)|00) + cos(z/8) sin(/8)|01) 
—sin(7/8) cos(x/8)|10) + sin? (7 /8)|11) 
— sin? (z/8)|00) + sin(/8) cos(1/8)|01) 
—cos(7/8) sin(7/8)|10) + cos? (7 /8)|11) . 


Using the trigonometric identities 2 sin(a) cos(a) = sin(2a) and 
cosa) — sin? (a) = cos(2a), we see that the probability of getting any 
one of |00), |10), |01), |11) is proportional to cos(7/4) = sin(7/4) = F 
Hence all four options for (a, b) are equally likely, which mean that in 
this case a = b with probability 0.5. 

Taking all the four cases together, the overall probability of winning 


the game is at least 4 - 1 + $ - 0.85 + $ - 0.5 = 0.8. 
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23.8 QUANTUM COMPUTATION 


Recall that in the classical setting, we modeled computation as ob- 
tained by a sequence of basic operations. We had two types of computa- 
tional models: 


e Non uniform models of computation such as Boolean circuits and 
NAND-CIRC programs, where a finite function f : {0,1}”" — {0,1} 
is computable in size T if it can be expressed as a combination of 
T basic operations (gates in a circuit or lines in a NAND-CIRC 
program) 


e Uniform models of computation such as Turing machines and NAND- 
TM programs, where an infinite function F : {0,1}* — {0,1} is 
computable in time T(n) if there is a single algorithm that on input 
x € {0,1}” evaluates F(x) using at most T(n) basic steps. 


When considering efficient computation, we defined the class P to 
consist of all infinite functions F : {0,1}* — {0,1} that can be com- 
puted by a Turing machine or NAND-IM program in time p(n) for 
some polynomial p(-). We defined the class P pory to consists of all 
infinite functions F : {0,1}* — {0,1} such that for every n, the re- 
striction F,,, of F to {0, 1}” can be computed by a Boolean circuit or 
NAND-CIRC program of size at most p(n) for some polynomial p(-). 

We will do the same for quantum computation, focusing mostly on 
the non uniform setting of quantum circuits, since that is simpler, and 
already illustrates the important differences with classical computing. 
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23.8.1 Quantum circuits 
A quantum circuit is analogous to a Boolean circuit, and can be de- 
scribed as a directed acyclic graph. One crucial difference that the 
out degree of every vertex in a quantum circuit is at most one. This 
is because we cannot “reuse” quantum states without measuring 
them (which collapses their “probabilities”). Therefore, we can- ee 
: i : 13 This is known as the No Cloning Theorem. 
not use the same qubit as input for two different gates. ° Another 
more technical difference is that to express our operations as uni- 
tary matrices, we will need to make sure all our gates are reversible. 
This is not hard to ensure. For example, in the quantum context, in- 
stead of thinking of NAND as a (non reversible) map from {0, 1}? to 
{0, 1}, we will think of it as the reversible map on three qubits that 
maps a, b,c to a,b,c @ NAND(a, b) (ie., flip the last bit if NAND 
of the first two bits is 1). Equivalently, the NAND operation cor- 
responds to the 8 x 8 unitary matrix Uy 4y p such that (identify- 
ing {0,1}° with [8]) for every a,b,c € {0,1}, if |abc) is the basis 
element with 1 in the abc-th coordinate and zero elsewhere, then 
Unanplabc) = |ab(c e NAND(a, b))).14 If we order the rows and 
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14 Readers familiar with quantum computing should 
note that U y 4n is a close variant of the so called 


columns as 000,001, 010,..., 111, then Uy Anp can be written as the Toffoli gate and so QNAND-CIRC programs corre- 


following matrix: Toffoli gates. 


UNAND = 


oo QO O Q O H O 
D O G GO G O oO Fr 
ooo ocoroeoc oO 
oo oo co fo -— oo © 
oS = oc oS oS & 
o.oo oF oOo co So a 
oof oo oS eo SS 
= co oS oS So So So & 


If we have an n qubit system, then for i, j,k € [n], we will denote 
by URIK p as the 2” x 2” unitary matrix that corresponds to applying 
Uy anp to the i-th, j-th, and k-th bits, leaving the others intact. That is, 
for every v = yeas vlz), UN AND? = 2 refo,» Vrl£o Ep1(2p ® 
NAND (zx;, £;))Ek41 7 n1) 

As mentioned above, we will also use the Hadamard or HAD opera- 
tion. A quantum circuit is obtained by applying a sequence of Uy an p 


and HAD gates, which correspond to the matrix 


+1 +1 
anh 
roa (t ), 
Another way to define H is that forb € {0,1}, H|b) = 50) + 
a (—1)|1). We define HAD’ to be the 2” x 2” unitary matrix that 
applies HAD to the i-th qubit and leaves the others intact. Using the 


spond to quantum circuits with the Hadamard and 
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ket notation, we can write this as 


HAD! > vl) = 4 DO leorra) (10) + (-1)™ 


we{0,1}” we{0,1}” 


A quantum circuit is obtained by composing these basic operations 
on some m qubits. If m > n, we use a circuit to compute a function 


f :{0,1}" > {0,1}: 


e On input z, we initialize the system to hold x, ...,x,,_; in the first n 
qubits, and initialize all remaining m — n qubits to zero. 


e We execute each elementary operation one by one. 


e At the end of the computation, we measure the system, and output 
the result of the last qubit (i.e. the qubit in location m — 1).15 


e We say that the circuit computes f, if the probability that this output 
equals f(z) is at least 2/3. Note that this probability is obtained 
by summing up the squares of the amplitudes of all coordinates 
in the final state of the system corresponding to vectors |y) where 
Ym—1 = F(x). 


Formally this is defined as follows: 


Definition 23.7 — Quantum cireuit. A quantum circuit of m inputs and 
s gates over the {Uyn anp, HAD} basis is a sequence of s unitary 
2” x 2” matrices Up,...,U,_, such that each matrix U; is either of 
the form NAND‘** for i, j, k € [n] or HAD’ for i € [n]. 

A quantum circuit computes a function f : {0,1}" — {0,1} if the 
following is true for every x € {0,1}”: 

Let v be the vector 


v = U,_,U,_2--U,U9|20"™) 


and write v as ` ). Then 


yeo.ryn Ply 


ye 


yE{O,1}™ st. Ym—1=f(x) 


Once we have the notion of quantum circuits, we can define the 
quantum analog of P poy (i-e., define the class of functions com- 
putable by polynomial size quantum circuits) as follows: 


1)) |2; En): 


15 For simplicity we restrict attention to functions 
with a single bit of output, though the definition of 
quantum circuits naturally extends to circuits with 
multiple outputs. 


Definition 23.8 — BQP poy: Let F : {0,1}* — {0,1}. We say that 
F € BOP poi if there exists some polynomial p : N — N such that 
foreveryn €E N,if Fyn is the restriction of F to inputs of length n, 
then there is a quantum circuit of size at most p(n) that computes 

ti 


n’ 


@) 


Remark 23.9 — The obviously exponential fallacy. A 
priori it might seem “obvious” that quantum com- 
puting is exponentially powerful, since to perform a 
quantum computation on n bits we need to maintain 
the 2” dimensional state vector and apply 2” x 2” ma- 
trices to it. Indeed popular descriptions of quantum 
computing (too) often say something along the lines 
that the difference between quantum and classical 
computer is that a classical bit can either be zero or 
one while a qubit can be in both states at once, and 
so in many qubits a quantum computer can perform 
exponentially many computations at once. 
Depending on how you interpret it, this description 
is either false or would apply equally well to proba- 
bilistic computation, even though we've already seen 
that every randomized algorithm can be simulated by 
a similar-sized circuit, and in fact we conjecture that 
BPP = P. 

Moreover, this “obvious” approach for simulating 

a quantum computation will take not just exponen- 
tial time but exponential space as well, while it can be 
shown that using a simple recursive formula one can 
calculate the final quantum state using polynomial 
space (in physics this is known as “Feynman path inte- 
grals”). So, the exponentially long vector description 
by itself does not imply that quantum computers are 
exponentially powerful. Indeed, we cannot prove that 
they are (i.e., as far as we know, every QNAND-CIRC 
program could be simulated by a NAND-CIRC pro- 
gram with polynomial overhead), but we do have 
some problems (integer factoring most prominently) 
for which they do provide exponential speedup over 
the currently best known classical (deterministic or 
probabilistic) algorithms. 


23.8.2 QNAND-CIRC programs (optional) 

Just like in the classical case, there is an equivalence between circuits 
and straightline programs, and so we can define the programming 
language QNAND that is the quantum analog of our NAND-CIRC 
programming language. To do so, we only add a single operation: 


HAD (foo) which applies the single-bit operation H to the variable foo. 
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We also use the following interpretation to make NAND reversible: foo 
= NAND(bar,blah) means that we modify foo to be the XOR of its 
original value and the NAND of bar and blah. (In other words, apply 
the 8 by 8 unitary transformation Uy 4) p defined above to the three 
qubits corresponding to foo, bar and blah.) If foo is initialized to 
zero then this makes no difference. 

If P isa QNAND-CIRC program with n input variables, ¢ 
workspace variables, and m output variables, then running it on the 
input x € {0, 1}” corresponds to setting up a system with n + m + £ 
qubits and performing the following process: 


1. We initialize the input variables X[0] ... X[n — 1] to x,...,#,,_, and 
all other variables to 0. 


2. We execute the program line by line, applying the corresponding 
physical operation H or Uy 4n p to the qubits that are referred to by 
the line. 


3. We measure the output variables Y[0], ..., Yim — 1] and output 
the result (if there is more than one output then we measure more 
variables). 


23.8.3 Uniform computation 
Just as in the classical case, we can define uniform computational mod- 
els. For example, we can define the QNAND-TM programming language 
to be QNAND augmented with loops and arrays just like NAND- 
TM is obtained from NAND. Using this we can define the class BOP 
which is the uniform analog of BOP poy- Just as in the classical setting 
it holds that BPP C P pory in the quantum setting it can be shown that 
BOP C BOP pory: Just like the classical case, we can also use Quantum 
Turing Machines instead of QNAND-IM to define BOP. 
Yet another way to define BQP is the following: a function F : 
{0,1}* — {0,1} is in BQP if (1) F € BQP pory and (2) moreover 
for every n, the quantum circuit that verifies this can be generated 16 This is analogous to the alternative characterization 
by a classical polynomial time NAND-TM program (or, equivalently, of P that appears in ??. 
a polynomial-time Turing machine).!® We use this definition here, 
though an equivalent one can be made using QNAND-IM or quan- 
tum Turing machines: 


Definition 23.10 — The class BOP. Let F : {0,1}* — {0,1}. We say that 
F € BQP if there exists a polynomial time NAND-TM program P 
such that for every n, P(1”) is the description of a quantum circuit 
C, that computes the restriction of F to {0,1}. 


The relation between NP and BQP is not known (see also Re- 
mark 23.4). It is widely believed that NP ¢ BQP, but there is no 
consensus whether or not BOP C NP. It is quite possible that these 
two classes are incomparable, in the sense that NP ¢ BQP (and in par- 
ticular no NP-complete function belongs to BOP) but also BOP ¢ NP 
(and there are some interesting candidates for such problems). 

It can be shown that QNANDEVAL (evaluating a quantum circuit 
on an input) is computable by a polynomial size QNAND-CIRC pro- 
gram, and moreover this program can even be generated uniformly 
and hence QNANDEVAL is in BQP. This allows us to “port” many 
of the results of classical computational complexity into the quantum 
realm as well. 


23.9 PHYSICALLY REALIZING QUANTUM COMPUTATION 


To realize quantum computation one needs to create a system with n 
independent binary states (i.e., “qubits”), and be able to manipulate 
small subsets of two or three of these qubits to change their state. 
While by the way we defined operations above it might seem that 

one needs to be able to perform arbitrary unitary operations on these 
two or three qubits, it turns out that there several choices for universal 
sets - a small constant number of gates that generate all others. The 
biggest challenge is how to keep the system from being measured and 
collapsing to a single classical combination of states. This is sometimes 
known as the coherence time of the system. The threshold theorem says 
that there is some absolute constant level of errors 7 so that if errors 
are created at every gate at rate smaller than 7 then we can recover 
from those and perform arbitrary long computations. (Of course there 
are different ways to model the errors and so there are actually several 
threshold theorems corresponding to various noise models). 
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There have been several proposals to build quantum computers: 


e Superconducting quantum computers use super-conducting elec- 
tric circuits to do quantum computation. These are currently the 
devices with largest number of fully controllable qubits. 


e At Harvard, Lukin’s group is using cold atoms to implement quan- 
tum computers. 


e Trapped ion quantum computers use the states of an ion to sim- 
ulate a qubit. People have made some recent advances on these 
computers too. For example, an ion-trap computer was used to im- 
plement Shor’s algorithm to factor 15. (It turns out that 15 = 3 x 5 


:) ) 


e Topological quantum computers use a different technology, which 
is more stable by design but arguably harder to manipulate to cre- 
ate quantum computers. 


These approaches are not mutually exclusive and it could be that 
ultimately quantum computers are built by combining all of them 
together. At the moment, we have devices with about 100 qubits, 
and about 1% error per gate. Such restricted machines are sometimes 
called “Noisy Intermediate-Scale Quantum Computers” or “NISQ”. 
See this article by John Preskil for some of the progress and applica- 
tions of such more restricted devices. If the number of qubits is in- 
creased and the error is decreased by one or two orders of magnitude, 
we could start seeing more applications. 


23.10 SHOR’S ALGORITHM: HEARING THE SHAPE OF PRIME FAC- 
TORS 


Bell’s Inequality is a powerful demonstration that there is some- 
thing very strange going on with quantum mechanics. But could 

this “strangeness” be of any use to solve computational problems not 
directly related to quantum systems? A priori, one could guess the 
answer is no. In 1994 Peter Shor showed that one would be wrong: 


Theorem 23.12 — Shor’s Algorithm. There is a polynomial-time quan- 
tum algorithm that on input an integer M (represented in base 
two), outputs the prime factorization of M. 


Another way to state Theorem 23.12 is that if we define 
FACTORING : {0,1}* — {0,1} to be the function that on input a 
pair of numbers (M, X) outputs 1 if and only if M has a factor P such 
that 2 < P < X, then FACTORING is in BQP. This is an exponential 
improvement over the best known classical algorithms, which take 


Figure 23.3: Superconducting quantum computer 
prototype at Google. Image credit: Google / MIT 
Technology Review. 


roughly 20(n"'*) time, where the O notation hides factors that are 
polylogarithmic in n. While we will not prove Theorem 23.12 in this 


chapter, we will sketch some of the ideas behind the proof. 


23.10.1 Period finding 

At the heart of Shor’s Theorem is an efficient quantum algorithm for 
finding periods of a given function. For example, a function f : R —> R 
is periodic if there is some h > 0 such that f(x +h) = f(x) for every x 
(e.g., see Fig. 23.4). 

Musical notes yield one type of periodic function. When you pull 
on a string on a musical instrument, it vibrates in a repeating pattern. 
Hence, if we plot the speed of the string (and so also the speed of 
the air around it) as a function of time, it will correspond to some 
periodic function. The length of the period is known as the wave length 
of the note. The frequency is the number of times the function repeats 
itself within a unit of time. For example, the “Middle C” note has 
a frequency of 261.63 Hertz, which means its period is 1/(261.63) 
seconds. 

If we play a chord by playing several notes at once, we get a more 
complex periodic function obtained by combining the functions of 
the individual notes (see Fig. 23.5). The human ear contains many 
small hairs, each of which is sensitive to a narrow band of frequencies. 
Hence when we hear the sound corresponding to a chord, the hairs in 
our ears actually separate it out to the components corresponding to 
each frequency. 

It turns out that (essentially) every periodic function f : R — R 
can be decomposed into a sum of simple wave functions (namely 
functions of the form z + sin(@z) or x œ cos(Ax)). This is known as 
the Fourier Transform (see Fig. 23.6). The Fourier transform makes it 
easy to compute the period of a given function: it will simply be the 
least common multiple of the periods of the constituent waves. 


23.10.2 Shor’s Algorithm: A bird’s eye view 

On input a an integer M, Shor’s algorithm outputs the prime factor- 
ization of M in time that is polynomial in log M. The main steps in the 
algorithm are the following: 


Step 1: Reduce to period finding. The first step in the algorithm is to 
pick a random A € {0,1..., M — 1} and define the function F', : 
{0,1} — {0,1} as F(z) = A*( mod M) where we identify the 
string x € {0,1}™ with an integer using the binary representation, 
and similarly represent the integer A”( mod M) asa string. (We will 
choose m to be some polynomial in m and so in particular {0,1}” is a 
large enough set to represent all the numbers in {0,1,..., M —1}). 
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Figure 23.4: Top: A periodic function. Bottom: An 
a-periodic function. 


Time Domain Frequency Domain 


Figure 23.5: Left: The air-pressure when playing a 

“C Major” chord as a function of time. Right: The 
coefficients of the Fourier transform of the same func- 
tion, we can see that it is the sum of three freuencies 
corresponding to the C, E and G notes (261.63, 329.63 
and 392 Hertz respectively). Credit: Bjarke Monsted’s 
Quora answer. 


a r 


Xaz 


Figure 23.6: If f is a periodic function then when we 
represent it in the Fourier transform, we expect the 
coefficients corresponding to wavelengths that do 
not evenly divide the period to be very small, as they 
would tend to “cancel out”. 
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Some not-too-hard (though somewhat technical) calculations show 
that: (1) The function F', is periodic (i.e., there is some integer p4 such 
that F4(x + p4) = F(z) for almost!” every x) and more importantly 
(2) If we can recover the period p4 of F4 for several randomly chosen 
A’s, then we can recover the factorization of M. Hence, factoring M 
reduces to finding out the period of the function F4. Exercise 23.2 
asks you to work out this for the related task of computing the discrete 
logarithm (which underlies the security of the Diffie-Hellman key 
exchange and elliptic curve cryptography). 


Step 2: Period finding via the Quantum Fourier Transform. Using a simple 
trick known as “repeated squaring”, it is possible to compute the 
map x ++ F4(x) in time polynomial in m, which means we can also 
compute this map using a polynomial number of NAND gates,and so 
in particular we can generate in polynomial quantum time a quantum 
state p that is (up to normalization) equal to 


5 oE) - 


xe{0,1}™ 


In particular, if we were to measure the state p, we would get a ran- 
dom pair of the form (x, y) where y = F,(x). So far, this is not at all 
impressive. After all, we did not need the power of quantum comput- 
ing to generate such pairs: we could simply generate a random «x and 
then compute F',(z). 

Another way to describe the state p is that the coefficient of |x)|y) in 
p is proportional to f4 ,(x) where f4 „ : {0,1} — R is the function 
such that 


1 y=A*( mod M) 
faa) = . ; 
0 otherwise 


The magic of Shor’s algorithm comes from a procedure known 
as the Quantum Fourier Transform. It allows to change the state p into 
the state @ where the coefficient of |x) |y) is now proportional to the 
x-th Fourier coefficient of f ‘Avy: In other words, if we measure the state 
p, we will obtain a pair (x, y) such that the probability of choosing x 
is proportional to the square of the weight of the frequency x in the 
representation of the function f4 ,. Since for every y, the function 
fay has the period p4, it can be shown that the frequency x will be 
(almost!8) a multiple of p4. If we make several such samples yo, -.. , Yy 
and obtain the frequencies x1, ... , £4, then the true period p, divides 
all of them, and it can be shown that it is going to be in fact the greatest 
common divisor (g.c.d.) of all these frequencies: a value which can be 
computed in polynomial time. 


We'll ignore this “almost” qualifier in the discussion 
below. It causes some annoying, yet ultimately 
manageable, technical issues in the full-fledged 
algorithm. 


18 The “almost” qualifier again appears because 

the original function was only “almost” periodic, 

but it turns out this can be handled by using an 
“approximate greatest common divisor” algorithm 
instead of a standard g.c.d. below. The latter can be 
obtained using a tool known as the continued fraction 
representation of a number. 


As mentioned above, we can recover the factorization of M from 
the periods of F'4,,..., F'4, for some randomly chosen Ag,... , A; in 
{0, ..., M — 1} and t which is polynomial in log M. 

The resulting algorithm can be described in a high (and somewhat 
inaccurate) level as follows: 


Shor’s Algorithm: (sketch) 
Input: Number M € N. 
Output: Prime factorization of M. 


Operations: 


1. Repeat the following k = poly(log M) number of times: 


a. Choose A € {0,..., M — 1} at random, and let f4 : Zy > Zy be 
the map x =œ A” mod M. 

b. Fort = poly(log M), repeat t times the following step: Quantum 
Fourier Transform to create a quantum state |w) over poly(log(m)) 
qubits, such that if we measure |w) we obtain a pair of strings 
(j, y) with probability proportional to the square of the coefficient 
corresponding to the wave function x œ cos(a7j/M) orx > 
sin(a77/M) in the Fourier transform of the function f aay Le 
{0, 1} defined as f4 (a) = 1 iff fa(£) = y. 

c. If j,,..., ji are the coefficients we obtained in the previous step, 
then the least common multiple of M/j,,...,M/j, is likely to be 
the period of the function f4. 


2. If we let Ag,...,A,_, and po,...,p,_, be the numbers we chose in 
the previous step and the corresponding periods of the functions 
fag: ++ Fap then we can use classical results in number theory to 
obtain from these a non-trivial prime factor Q of M (if such exists). 
We can now run the algorithm again with the (smaller) input M/Q 
to obtain all other factors. 


Reducing factoring to order finding is cummbersome, but can be 
done in polynomial time using a classical computer. The key quantum 


ingredient in Shor’s algorithm is the quantum fourier transform. 
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23.11 QUANTUM FOURIER TRANSFORM (ADVANCED, OPTIONAL) 


The above description of Shor’s algorithm skipped over the implemen- 
tation of the main quantum ingredient: the Quantum Fourier Transform 
algorithm. In this section we discuss the ideas behind this algorithm. 
We will be rather brief and imprecise. Remark 23.3 and Section 23.13 
contain references to sources of more information about this topic. 

To understand the Quantum Fourier Transform, we need to better 
understand the Fourier Transform itself. In particular, we will need 
to understand how it applies not just to functions whose input is a 
real number but to functions whose domain can be any arbitrary 
commutative group. Therefore we now take a short detour to (very 
basic) group theory, and define the notion of periodic functions over 
groups. 


(R) 


The Fourier transform is a deep and vast topic, on which we will 
barely touch upon here. Over the real numbers, the Fourier trans- 
form of a function f is obtained by expressing f in the form > f(a)Xa 
where the y,,’s are “wave functions” (e.g. sines and cosines). How- 
ever, it turns out that the same notion exists for every Abelian group 6. 


Specifically, for every such group G, if f is a function mapping G to C, 
then we can write f as 


f=>_ flor, > (23.2) 
gEG 

where the y,’s are functions mapping G to C that are analogs of 
the “wave functions” for the group G and for every g € G, f(g) isa 
complex number known as the Fourier coefficient of f corresponding to 
g.” The representation (23.2) is known as the Fourier expansion or 
Fourier transform of f, the numbers (f(g)) geo are known as the Fourier 
coefficients of f and the functions (\,),<¢ are known as the Fourier 
characters. The central property of the Fourier characters is that they 
are homomorphisms of the group into the complex numbers, in the 
sense that for every x, x’ € G,x,(" x 2’) = x,(x)x,4(2’), where x is 
the group operation. One corollary of this property is that if y,(h) = 1 
then x, is h periodic in the sense that x(x x h) = x(x) for every z. 
It turns out that if f is periodic with minimal period h, then the only 
Fourier characters that have non zero coefficients in the expression 
(23.2) are those that are h periodic as well. This can be used to recover 
the period of f from its Fourier expansion. 


23.11.1 Quantum Fourier Transform over the Boolean Cube: Simon’s 
Algorithm 

We now describe the simplest setting of the Quantum Fourier Trans- 

form: the group {0, 1}” with the XOR operation, which we'll de- 

note by ({0, 1}”, ®©). It can be shown that the Fourier transform over 

({0, 1}”, ®) corresponds to expressing f : {0,1}" — Cas 


f= > fae 
ye{O,1} 
where x, : {0,1}" — C is defined as y,(x) = (—1)2:%*: and 
Fy) = Pe Lareo fle) (EH. 
The Quantum Fourier Transform over ({0,1}”, ®) is actually qutie 
simple: 


Theorem 23.15 — QFT Over the Boolean Cube. Let p = Sena f(x)|x) 
be a quantum state where f : {0,1}”" — C is some function satisfy- 
ing D ye | f(x)|? = 1. Then we can use n gates to transform p to 


the state 


5 fo 


ye{o,1}” 
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1 The equation (23.2) means that if we think of f as 
a |G| dimensional vector over the complex numbers, 
then we can write this vector as a sum (with certain 
coefficients) of the vectors {X4 }gcc- 
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where f = >> 
ae) = SE 


y J Y)Xy and xy : {0,1}" — Cis the function 


Proof Idea: 

The idea behind the proof is that the Hadamard operation corre- 
sponds to the Fourier transform over the group {0, 1}” (with the XOR 
operations). To show this, we just need to do the calculations. 

* 


Proof of Theorem 23.15. We can express the Hadamard operation HAD 
as follows: 


HAD|a) = 4 (10) + (-1)*|1)) . 


We are given the state 


Now suppose that we apply the HAD operation to each of the n 
qubits. We can see that we get the state 


22 S f(a) T] (io) + e). 
i=0 


xeE{0,1}” 


We can now use the distributive law and open up a term of the 


form 
F(a)(|0) + (—1)?9|1)) = (10) + (—1)?"4]1)) 
, i, Tf you find this confusing, try to work out why (|0)+ 
to the following sum over 2 terms:22 (—1)"0 |1)) (0) + (—1)"1 |1)) (0) + (—1)"2 |1)) is the 
same as the sum over 2° terms |000) + (—1)*2|001) + 


fe) $ Hess vf (=1)70+21+22]111). 


ye{O,1}” 


y) - 


But by changing the order of summations, we see that the final state 
is 


S 2m $ fenia) 


ye{0,1}” xeE{0,1}” 


y) 


which exactly corresponds to p. 


23.11.2 From Fourier to Period finding: Simon’s Algorithm (advanced, 
optional) 

Using Theorem 23.15 it is not hard to get an algorithm that can recover 

a string h* € {0,1}” given a circuit that computes a function F : 

{0,1}” — {0,1}* that is h* periodic in the sense that F(x) = F(x’) for 
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distinct x, x’ if and only if x’ = x ® h*. The key observation is that if 


we compute the state >> |F'(x)), and perform the Quantum 


xe{0,1}" |x) 
Fourier transform on the first n qubits, then we would get a state such 
that the only basis elements with nonzero coefficients would be of the 


form |y) where 


So yhi = 0( mod 2) (23.3) 


So, by measuring the state, we can obtain a sample of a random y 
satisfying (23.3). But since (23.3) is a linear equation modulo 2 about 
the unknown n variables hj, ...,h*_,, if we repeat this procedure 
to get n such equations, we will have at least as many equations as 
variables and (it can be shown that) this will suffice to recover h*. 

This result is known as Simon’s Algorithm, and it preceded and 
inspired Shor’s algorithm. 


23.11.3 From Simon to Shor (advanced, optional) 

Theorem 23.15 seemed to really use the special bit-wise structure of 
the group {0, 1}", and so one could wonder if it can be extended to 
other groups. However, it turns out that we can in fact achieve sucha 
generalization. 

The key step in Shor’s algorithm is to implement the Fourier trans- 
form for the group Z; which is the set of numbers {0,..., L — 1} with 
the operation being addition modulo L. In this case it turns out that 
the Fourier characters are the functions x(x) = w”? where w = e?"'// 
(i here denotes the complex number V—1). The y-th Fourier coeffi- 
cient of a function f : Z; — Cis 


fo) =X flour. (23.4) 
eZ, 

The key to implementing the Quantum Fourier Transform for such 
groups is to use the same recursive equations that enable the classical 
Fast Fourier Transform (FFT) algorithm. Specifically, consider the case 
that L = 2°. We can separate the sum over x in (23.4) to the terms 
corresponding to even z’s (of the form x = 2z) and odd z’s (of the 
form z = 2z + 1) to obtain 


f= DS few)" D f2z+1@?)* (235) 


ZEZLJ2 zELZ yz /2 


which reduces computing the Fourier transform of f over the group 
Z to computing the Fourier transform of the functions feven and 
foaq (corresponding to the applying f to only the even and odd z’s 
respectively) which have 2°! inputs that we can identify with the 
group Zo- = Zz /9. 
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Specifically, the Fourier characters of the group Z jz are the func- 
tions x(x) = e?*/(4/2)ue = (w?)¥* for every x,y € Zz). Moreover, 
since w = 1, (w?)¥ = (w?)¥ mod 4/2 for every y € N. Thus (23.5) 


translates into 


fly) a feven(Y mod L/2) T w” fraaly mod L/2) à 


This observation is usually used to obtain a fast (e.g. O(L log L)) 
time to compute the Fourier transform in a classical setting, but it can 
be used to obtain a quantum circuit of poly(log L) gates to transform a 
state of the form veel, f(x)|x) to a state of the form D yeti fly) ly). 

The case that L is not an exact power of two causes some complica- 
tions in both the classical case of the Fast Fourier Transform and the 
quantum setting of Shor’s algorithm. However, it is possible to handle 
these. The idea is that we can embed Z, in the group Z 4.z for any 
integer A, and we can find an integer A such that A - L will be close 
enough to a power of 2 (i.e., a number of the form 2” for some m), so 
that if we do the Fourier transform over the group Zm then we will 


not introduce too many errors. 


23.12 EXERCISES 


Exercise 23.1 — Quantum and classical complexity class relations. Prove the 
following relations between quantum complexity classes and classical 
ones: 


1. Pipoy E BQP poy 
2. P C BQP.? 

3. BPP C BOP.*® 

4. BQP C EXP.” 


5. If SAT € BQP then NP C BQP.” 


Exercise 23.2 — Discrete logarithm from order finding. Show a probabilistic 
polynomial time classical algorithm that given an Abelian finite group 
G (in the form of an algorithm that computes the group operation), 

a generator g for the group, and an element h € G, as well access to a 
black box that on input f € G outputs the order of f (the smallest a 
such that f° = 1), computes the discrete logarithm of h with respect to 
g. That is the algorithm should output a number g such that g” = h. 
See footnote for hint.?° 


23.13 BIBLIOGRAPHICAL NOTES 


Chapters 9 and 10 in the book Quantum Computing Since Democritus 
give an informal but highly informative introduction to the topics 

of this lecture and much more. Shor’s and Simon’s algorithms are 
also covered in Chapter 10 of my book with Arora on computational 
complexity. 

There are many excellent videos available online covering some 
of these materials. The Fourier transform is covered in this videos of 
Dr. Chris Geoscience, Clare Zhang and Vi Hart. More specifically to 
quantum computing, the videos of Umesh Vazirani on the Quantum 
Fourier Transform and Kelsey Houston-Edwards on Shor’s Algorithm 
are very recommended. 

Chapter 10 in Avi Wigderson’s book gives a high level overview of 
quantum computing. Andrew Childs’ lecture notes on quantum algo- 
rithms, as well as the lecture notes of Umesh Vazirani, John Preskill, 
and John Watrous 
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2! Hint: You can use Uy 4 yp to simulate NAND gates. 


» Hint: Use the alternative characterization of P as in 
22. 

3 Hint: You can use the HAD gate to simulate a coin 
toss. 

% Hint: In exponential time simulating quantum 
computation boils down to matrix multiplication. 


3 Hint: If a reduction can be implemented in P it can 
be implemented in BOP as well. 


% We are given h = g” and need to recover x. To 

do so we can compute the order of various elements 
of the form h! gè. The order of such an element is 

a number c satisfying c(aa + b) = 0 (mod |C]). 
With a few random examples we will get a non trivial 
equation on «x (where c is not zero modulo |G|) and 
then we can use our knowledge of a, b, c to recover x. 
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Regarding quantum mechanics in general, this video illustrates 
the double slit experiment, this Scientific American video is a nice ex- 
position of Bell’s Theorem. This talk and panel moderated by Brian 
Greene discusses some of the philosophical and technical issues 
around quantum mechanics and its so called “measurement prob- 
lem”. The Feynmann lecture on the Fourier Transform and quantum 
mechanics in general are very much worth reading. 

The Fast Fourier Transform, used as a component in Shor’s algo- 
rithm, is one of the most useful algorithms across many applications 
areas. The stories of its discovery by Gauss in trying to calculate aster- 
oid orbits and rediscovery by Tukey during the cold war are fascinat- 
ing as well. 


23.14 FURTHER EXPLORATIONS 


Some topics related to this chapter that might be accessible to ad- 
vanced students include: (to be completed) 
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